Skip to left side bar
>
  • File
  • Edit
  • View
  • Run
  • Kernel
  • Tabs
  • Settings
  • Help

Open Tabs

  • 061-exploring-data.2022-12-23T05-39-09-138Z.ipynb
  • 062-clustering-two-features.ipynb
  • 063-clustering-multiple-features.ipynb
  • 064-interactive-dash-app.ipynb
  • 065-assignment.ipynb
  • 066-data-dictionary.ipynb

Kernels

    Terminals

      //ds-curriculum/060-consumer-finances-in-usa/
      Name
      ...
      Last Modified
      • .ipynb_checkpointsa month ago
      • data3 months ago
      • imagesa month ago
      • 061-exploring-data.2022-12-23T05-39-09-138Z.ipynba month ago
      • 061-exploring-data.ipynba month ago
      • 062-clustering-two-features.ipynba month ago
      • 063-clustering-multiple-features.ipynba month ago
      • 064-interactive-dash-app.ipynba month ago
      • 065-assignment.ipynba month ago
      • 066-data-dictionary.ipynba month ago
      064-interactive-dash-app.ipynb
      • 1. Prepare Data
      • 1.1. Import
      • 2. Build Dashboard
      • 2.1. Application Layout
      • 2.2. Variance Bar Chart
      • 2.3. K-means Slider and Metrics
      • 2.4. PCA Scatter Plot
      • 2.5. Application Deployment
      • 061-exploring-data.2022-12-23T05-39-09-138Z.ipynb
      • 062-clustering-two-features.ipynb
      • 063-clustering-multiple-features.ipynb
      • 064-interactive-dash-app.ipynb
      • 065-assignment.ipynb
      • 066-data-dictionary.ipynb
      xxxxxxxxxx
      <font size="+3"><strong>6.1. Exploring the Data</strong></font>

      6.1. Exploring the Data

      xxxxxxxxxx
      In this project, we're going to work with data from the [Survey of Consumer Finances](https://www.federalreserve.gov/econres/scfindex.htm) (SCF). The SCF is a survey sponsored by the US Federal Reserve. It tracks financial, demographic, and opinion information about families in the United States. The survey is conducted every three years, and we'll work with an extract of the results from 2019.

      In this project, we're going to work with data from the Survey of Consumer Finances (SCF). The SCF is a survey sponsored by the US Federal Reserve. It tracks financial, demographic, and opinion information about families in the United States. The survey is conducted every three years, and we'll work with an extract of the results from 2019.

      [1]:
       
      import matplotlib.pyplot as plt
      import pandas as pd
      import seaborn as sns
      import wqet_grader
      from IPython.display import VimeoVideo
      ​
      wqet_grader.init("Project 6 Assessment")
      ​
      [2]:
       
      VimeoVideo("710780578", h="43bb879d16", width=600)
      [2]:
      xxxxxxxxxx
      # Prepare Data

      1. Prepare Data¶

      xxxxxxxxxx
      ## Import

      1.1. Import¶

      xxxxxxxxxx
      First, we need to load the data, which is stored in a compressed CSV file: `SCFP2019.csv.gz`. In the last project, you learned how to decompress files using `gzip` and the command line. However, pandas `read_csv` function can work with compressed files directly. 

      First, we need to load the data, which is stored in a compressed CSV file: SCFP2019.csv.gz. In the last project, you learned how to decompress files using gzip and the command line. However, pandas read_csv function can work with compressed files directly.

      [ ]:
       
      VimeoVideo("710781788", h="efd2dda882", width=600)
      xxxxxxxxxx
      **Task 6.1.1:** Read the file `"data/SCFP2019.csv.gz"` into the DataFrame `df`.

      Task 6.1.1: Read the file "data/SCFP2019.csv.gz" into the DataFrame df.

      • Read a CSV file into a DataFrame using pandas.
      [ ]:
       
      df = ...
      print("df shape:", df.shape)
      df.head()
      xxxxxxxxxx
      One of the first things you might notice here is that this dataset is HUGE — over 20,000 rows and 351 columns! SO MUCH DATA!!! We won't have time to explore all of the features in this dataset, but you can look in the [data dictionary](./066-data-dictionary.ipynb) for this project for details and links to the official [Code Book](https://sda.berkeley.edu/sdaweb/docs/scfcomb2019/DOC/hcbk.htm). For now, let's just say that this dataset tracks all sorts of behaviors relating to the ways households earn, save, and spend money in the United States.

      One of the first things you might notice here is that this dataset is HUGE — over 20,000 rows and 351 columns! SO MUCH DATA!!! We won't have time to explore all of the features in this dataset, but you can look in the data dictionary for this project for details and links to the official Code Book. For now, let's just say that this dataset tracks all sorts of behaviors relating to the ways households earn, save, and spend money in the United States.

      For this project, we're going to focus on households that have "been turned down for credit or feared being denied credit in the past 5 years." These households are identified in the "TURNFEAR" column.

      [ ]:
       
      VimeoVideo("710783015", h="c24ce96aab", width=600)
      xxxxxxxxxx
      **Task 6.1.2:** Use a`mask` to subset create `df` to only households that have been turned down or feared being turned down for credit (`"TURNFEAR" == 1`). Assign this subset to the variable name `df_fear`.

      Task 6.1.2: Use amask to subset create df to only households that have been turned down or feared being turned down for credit ("TURNFEAR" == 1). Assign this subset to the variable name df_fear.

      • Subset a DataFrame with a mask using pandas.
      [ ]:
       
      mask = ...
      df_fear = ...
      print("df_fear shape:", df_fear.shape)
      df_fear.head()
      xxxxxxxxxx
      ## Explore

      1.2. Explore¶

      xxxxxxxxxx
      ### Age

      1.2.1. Age¶

      xxxxxxxxxx
      Now that we have our subset, let's explore the characteristics of this group. One of the features is age group (`"AGECL"`).

      Now that we have our subset, let's explore the characteristics of this group. One of the features is age group ("AGECL").

      [ ]:
       
      VimeoVideo("710784794", h="71b10e363d", width=600)
      xxxxxxxxxx
      **Task 6.1.3:** Create a list `age_groups` with the unique values in the `"AGECL"` column. Then review the entry for `"AGECL"` in the [Code Book](https://sda.berkeley.edu/sdaweb/docs/scfcomb2019/DOC/hcbkfx0.htm) to determine what the values represent.

      Task 6.1.3: Create a list age_groups with the unique values in the "AGECL" column. Then review the entry for "AGECL" in the Code Book to determine what the values represent.

      • Determine the unique values in a column using pandas.
      [ ]:
       
      age_groups = ...
      print("Age Groups:", age_groups)
      xxxxxxxxxx
      Looking at the Code Book we can see that `"AGECL"` represents categorical data, even though the values in the column are numeric.

      Looking at the Code Book we can see that "AGECL" represents categorical data, even though the values in the column are numeric.

      Image

      This simplifies data storage, but it's not very human-readable. So before we create a visualization, let's create a version of this column that uses the actual group names.

      [ ]:
      xxxxxxxxxx
       
      VimeoVideo("710785566", h="f0fafd3a29", width=600)
      xxxxxxxxxx
      **Task 6.1.4:** Create a Series `agecl` that contains the observations from `"AGECL"` using the true group names. 

      Task 6.1.4: Create a Series agecl that contains the observations from "AGECL" using the true group names.

      • Create a Series in pandas.
      • Replace values in a column using pandas.
      [ ]:
      xxxxxxxxxx
       
      agecl_dict = {
          1: "Under 35",
          2: "35-44",
          3: "45-54",
          4: "55-64",
          5: "65-74",
          6: "75 or Older",
      }
      ​
      age_cl = ...
      age_cl.head()
      xxxxxxxxxx
      Now that we have better labels, let's make a bar chart and see the age distribution of our group.

      Now that we have better labels, let's make a bar chart and see the age distribution of our group.

      [ ]:
      xxxxxxxxxx
       
      VimeoVideo("710840376", h="d43825c14b", width=600)
      xxxxxxxxxx
      **Task 6.1.5:** Create a bar chart showing the value counts from `age_cl`. Be sure to label the x-axis `"Age Group"`, the y-axis `"Frequency (count)"`, and use the title `"Credit Fearful: Age Groups"`.

      Task 6.1.5: Create a bar chart showing the value counts from age_cl. Be sure to label the x-axis "Age Group", the y-axis "Frequency (count)", and use the title "Credit Fearful: Age Groups".

      • Create a bar chart using pandas.
      [ ]:
      xxxxxxxxxx
       
      age_cl_value_counts = ...
      ​
      # Bar plot of `age_cl_value_counts`
      ​
      xxxxxxxxxx
      You might have noticed that by creating their own age groups, the authors of the survey have basically made a histogram for us comprised of 6 bins. Our chart is telling us that many of the people who fear being denied credit are younger. But the first two age groups cover a wider range than the other four. So it might be useful to look inside those values to get a more granular understanding of the data.

      You might have noticed that by creating their own age groups, the authors of the survey have basically made a histogram for us comprised of 6 bins. Our chart is telling us that many of the people who fear being denied credit are younger. But the first two age groups cover a wider range than the other four. So it might be useful to look inside those values to get a more granular understanding of the data.

      To do that, we'll need to look at a different variable: "AGE". Whereas "AGECL" was a categorical variable, "AGE" is continuous, so we can use it to make a histogram of our own.

      [ ]:
      xxxxxxxxxx
       
      VimeoVideo("710841580", h="a146a24e5c", width=600)
      xxxxxxxxxx
      **Task 6.1.6:** Create a histogram of the `"AGE"` column with 10 bins. Be sure to label the x-axis `"Age"`, the y-axis `"Frequency (count)"`, and use the title `"Credit Fearful: Age Distribution"`. 

      Task 6.1.6: Create a histogram of the "AGE" column with 10 bins. Be sure to label the x-axis "Age", the y-axis "Frequency (count)", and use the title "Credit Fearful: Age Distribution".

      • Create a histogram using pandas.
      [ ]:
      xxxxxxxxxx
       
      # Plot histogram of "AGE"
      ​
      xxxxxxxxxx
      It looks like younger people are still more concerned about being able to secure a loan than older people, but the people who are *most* concerned seem to be between 30 and 40. 

      It looks like younger people are still more concerned about being able to secure a loan than older people, but the people who are most concerned seem to be between 30 and 40.

      xxxxxxxxxx
      ### Race

      1.2.2. Race¶

      xxxxxxxxxx
      Now that we have an understanding of how age relates to our outcome of interest, let's try some other possibilities, starting with race. If we look at the [Code Book](https://sda.berkeley.edu/sdaweb/docs/scfcomb2019/DOC/hcbk0001.htm#RACE) for `"RACE"`, we can see that there are 4 categories.

      Now that we have an understanding of how age relates to our outcome of interest, let's try some other possibilities, starting with race. If we look at the Code Book for "RACE", we can see that there are 4 categories.

      Image

      Note that there's no 4 category here. If a value for 4 did exist, it would be reasonable to assign it to "Asian American / Pacific Islander" — a group that doesn't seem to be represented in the dataset. This is a strange omission, but you'll often find that large public datasets have these sorts of issues. The important thing is to always read the data dictionary carefully. In this case, remember that this dataset doesn't provide a complete picture of race in America — something that you'd have to explain to anyone interested in your analysis.

      [ ]:
      xxxxxxxxxx
       
      VimeoVideo("710842177", h="8d8354e091", width=600)
      xxxxxxxxxx
      **Task 6.1.7:** Create a horizontal bar chart showing the normalized value counts for `"RACE"`. In your chart, you should replace the numerical values with the true group names. Be sure to label the x-axis `"Frequency (%)"`, the y-axis `"Race"`, and use the title `"Credit Fearful: Racial Groups"`. Finally, set the `xlim` for this plot to `(0,1)`.

      Task 6.1.7: Create a horizontal bar chart showing the normalized value counts for "RACE". In your chart, you should replace the numerical values with the true group names. Be sure to label the x-axis "Frequency (%)", the y-axis "Race", and use the title "Credit Fearful: Racial Groups". Finally, set the xlim for this plot to (0,1).

      • Create a bar chart using pandas.
      [ ]:
      xxxxxxxxxx
       
      race_dict = {
          1: "White/Non-Hispanic",
          2: "Black/African-American",
          3: "Hispanic",
          5: "Other",
      }
      race = ...
      race_value_counts = ...
      # Create bar chart of race_value_counts
      ​
      plt.xlim((0, 1))
      plt.xlabel("Frequency (%)")
      plt.ylabel("Race")
      plt.title("Credit Fearful: Racial Groups");
      xxxxxxxxxx
      This suggests that White/Non-Hispanic people worry more about being denied credit, but thinking critically about what we're seeing, that might be because there are more White/Non-Hispanic in the population of the United States than there are other racial groups, and the sample for this survey was specifically drawn to be representative of the population as a whole.

      This suggests that White/Non-Hispanic people worry more about being denied credit, but thinking critically about what we're seeing, that might be because there are more White/Non-Hispanic in the population of the United States than there are other racial groups, and the sample for this survey was specifically drawn to be representative of the population as a whole.

      [ ]:
      xxxxxxxxxx
       
      VimeoVideo("710844376", h="8e1fdf92ef", width=600)
      xxxxxxxxxx
      **Task 6.1.8:** Recreate the horizontal bar chart you just made, but this time use the entire dataset `df` instead of the subset `df_fear`. The title of this plot should be `"SCF Respondents: Racial Groups"`

      Task 6.1.8: Recreate the horizontal bar chart you just made, but this time use the entire dataset df instead of the subset df_fear. The title of this plot should be "SCF Respondents: Racial Groups"

      • Create a bar chart using pandas.
      [ ]:
      xxxxxxxxxx
       
      race = ...
      race_value_counts = ...
      # Create bar chart of race_value_counts
      ​
      plt.xlim((0, 1))
      plt.xlabel("Frequency (%)")
      plt.ylabel("Race")
      plt.title("SCF Respondents: Racial Groups");
      xxxxxxxxxx
      How does this second bar chart change our perception of the first one? On the one hand, we can see that White Non-Hispanics account for around 70% of whole dataset, but only 54% of credit fearful respondents. On the other hand, Black and Hispanic respondents represent 23% of the whole dataset but 40% of credit fearful respondents. In other words, Black and Hispanic households are actually *more* likely to be in the credit fearful group. 

      How does this second bar chart change our perception of the first one? On the one hand, we can see that White Non-Hispanics account for around 70% of whole dataset, but only 54% of credit fearful respondents. On the other hand, Black and Hispanic respondents represent 23% of the whole dataset but 40% of credit fearful respondents. In other words, Black and Hispanic households are actually more likely to be in the credit fearful group.

      xxxxxxxxxx
      <div class="alert alert-block alert-warning">
      Data Ethics: It's important to note that segmenting customers by race (or any other demographic group) for the purpose of lending is illegal in the United States. The same thing might be legal elsewhere, but even if it is, making decisions for things like lending based on racial categories is clearly unethical. This is a great example of how easy it can be to use data science tools to support and propagate systems of inequality. Even though we're "just" using numbers, statistical analysis is never neutral, so we always need to be thinking critically about how our work will be interpreted by the end-user.
      xxxxxxxxxx
      ### Income

      1.2.3. Income¶

      xxxxxxxxxx
      What about income level? Are people with lower incomes concerned about being denied credit, or is that something people with more money worry about? In order to answer that question, we'll need to again compare the entire dataset with our subgroup using the `"INCCAT"` feature, which captures income percentile groups. This time, though, we'll make a single, side-by-side bar chart.

      What about income level? Are people with lower incomes concerned about being denied credit, or is that something people with more money worry about? In order to answer that question, we'll need to again compare the entire dataset with our subgroup using the "INCCAT" feature, which captures income percentile groups. This time, though, we'll make a single, side-by-side bar chart.

      Image

      [ ]:
      xxxxxxxxxx
       
      VimeoVideo("710849451", h="34a367a3f9", width=600)
      xxxxxxxxxx
      **Task 6.1.9:** Create a DataFrame `df_inccat` that shows the normalized frequency for income categories for both the credit fearful and non-credit fearful households in the dataset. Your final DataFrame should look something like this:

      Task 6.1.9: Create a DataFrame df_inccat that shows the normalized frequency for income categories for both the credit fearful and non-credit fearful households in the dataset. Your final DataFrame should look something like this:

          TURNFEAR   INCCAT  frequency
      0          0   90-100   0.297296
      1          0  60-79.9   0.174841
      2          0  40-59.9   0.143146
      3          0     0-20   0.140343
      4          0  21-39.9   0.135933
      5          0  80-89.9   0.108441
      6          1     0-20   0.288125
      7          1  21-39.9   0.256327
      8          1  40-59.9   0.228856
      9          1  60-79.9   0.132598
      10         1   90-100   0.048886
      11         1  80-89.9   0.045209
      
      • Aggregate data in a Series using value_counts in pandas.
      • Aggregate data using the groupby method in pandas.
      • Create a Series in pandas.
      • Rename a Series in pandas.
      • Replace values in a column using pandas.
      • Set and reset the index of a DataFrame in pandas.
      [ ]:
      xxxxxxxxxx
       
      inccat_dict = {
          1: "0-20",
          2: "21-39.9",
          3: "40-59.9",
          4: "60-79.9",
          5: "80-89.9",
          6: "90-100",
      }
      ​
      df_inccat = ...
      ​
      df_inccat
      [ ]:
      xxxxxxxxxx
       
      VimeoVideo("710852691", h="3dcbf24a68", width=600)
      xxxxxxxxxx
      **Task 6.1.10:** Using seaborn, create a side-by-side bar chart of `df_inccat`. Set `hue` to `"TURNFEAR"`, and make sure that the income categories are in the correct order along the x-axis. Label to the x-axis `"Income Category"`, the y-axis `"Frequency (%)"`, and use the title `"Income Distribution: Credit Fearful vs. Non-fearful"`.

      Task 6.1.10: Using seaborn, create a side-by-side bar chart of df_inccat. Set hue to "TURNFEAR", and make sure that the income categories are in the correct order along the x-axis. Label to the x-axis "Income Category", the y-axis "Frequency (%)", and use the title "Income Distribution: Credit Fearful vs. Non-fearful".

      • Create a bar chart using seaborn.
      [ ]:
      xxxxxxxxxx
       
      # Create bar chart of `df_inccat`
      ​
      plt.xlabel("Income Category")
      plt.ylabel("Frequency (%)")
      plt.title("Income Distribution: Credit Fearful vs. Non-fearful");
      xxxxxxxxxx
      Comparing the income categories across the fearful and non-fearful groups, we can see that credit fearful households are much more common in the lower income categories. In other words, the credit fearful have lower incomes. 

      Comparing the income categories across the fearful and non-fearful groups, we can see that credit fearful households are much more common in the lower income categories. In other words, the credit fearful have lower incomes.

      xxxxxxxxxx
      So, based on all this, what do we know? Among the people who responded that they were indeed worried about being approved for credit after having been denied in the past five years, a plurality of the young and low-income had the highest number of respondents. That makes sense, right? Young people tend to make less money and rely more heavily on credit to get their lives off the ground, so having been denied credit makes them more anxious about the future.

      So, based on all this, what do we know? Among the people who responded that they were indeed worried about being approved for credit after having been denied in the past five years, a plurality of the young and low-income had the highest number of respondents. That makes sense, right? Young people tend to make less money and rely more heavily on credit to get their lives off the ground, so having been denied credit makes them more anxious about the future.

      xxxxxxxxxx
      ### Assets

      1.2.4. Assets¶

      xxxxxxxxxx
      Not all the data is demographic, though. If you were working for a bank, you would probably care less about how old the people are, and more about their ability to carry more debt. If we were going to build a model for that, we'd want to establish some relationships among the variables, and making some correlation matrices is a good place to start.

      Not all the data is demographic, though. If you were working for a bank, you would probably care less about how old the people are, and more about their ability to carry more debt. If we were going to build a model for that, we'd want to establish some relationships among the variables, and making some correlation matrices is a good place to start.

      First, let's zoom out a little bit. We've been looking at only the people who answered "yes" when the survey asked about "TURNFEAR", but what if we looked at everyone instead? To begin with, let's bring in a clear dataset and run a single correlation.

      [ ]:
      xxxxxxxxxx
       
      VimeoVideo("710856200", h="7b06e8b7f2", width=600)
      xxxxxxxxxx
      **Task 6.1.11:** Calculate the correlation coefficient for `"ASSET"` and `"HOUSES"` in the whole dataset `df`.

      Task 6.1.11: Calculate the correlation coefficient for "ASSET" and "HOUSES" in the whole dataset df.

      • Calculate the correlation coefficient for two Series using pandas.
      [ ]:
      xxxxxxxxxx
       
      asset_house_corr = ...
      print("SCF: Asset Houses Correlation:", asset_house_corr)
      xxxxxxxxxx
      That's a moderate positive correlation, which we would probably expect, right? For many Americans, the value of their primary residence makes up most of the value of their total assets. What about the people in our `TURNFEAR` subset, though? Let's run that correlation to see if there's a difference.

      That's a moderate positive correlation, which we would probably expect, right? For many Americans, the value of their primary residence makes up most of the value of their total assets. What about the people in our TURNFEAR subset, though? Let's run that correlation to see if there's a difference.

      [ ]:
      xxxxxxxxxx
       
      VimeoVideo("710857088", h="33b8f810fb", width=600)
      xxxxxxxxxx
      **Task 6.1.12:** Calculate the correlation coefficient for `"ASSET"` and `"HOUSES"` in the whole credit-fearful subset `df_fear`.

      Task 6.1.12: Calculate the correlation coefficient for "ASSET" and "HOUSES" in the whole credit-fearful subset df_fear.

      • Calculate the correlation coefficient for two Series using pandas.
      [ ]:
      xxxxxxxxxx
       
      asset_house_corr = ...
      print("Credit Fearful: Asset Houses Correlation:", asset_house_corr)
      xxxxxxxxxx
      Aha! They're different! It's still only a moderate positive correlation, but the relationship between the total value of assets and the value of the primary residence is stronger for our `TURNFEAR` group than it is for the population as a whole. 

      Aha! They're different! It's still only a moderate positive correlation, but the relationship between the total value of assets and the value of the primary residence is stronger for our TURNFEAR group than it is for the population as a whole.

      Let's make correlation matrices using the rest of the data for both df and df_fear and see if the differences persist. Here, we'll look at only 5 features: "ASSET", "HOUSES", "INCOME", "DEBT", and "EDUC".

      [ ]:
      xxxxxxxxxx
       
      VimeoVideo("710857545", h="c67691d13e", width=600)
      xxxxxxxxxx
      **Task 6.1.13:** Make a correlation matrix using `df`, considering only the columns `"ASSET"`, `"HOUSES"`, `"INCOME"`, `"DEBT"`, and `"EDUC"`.

      Task 6.1.13: Make a correlation matrix using df, considering only the columns "ASSET", "HOUSES", "INCOME", "DEBT", and "EDUC".

      • Create a correlation matrix in pandas.
      [ ]:
      xxxxxxxxxx
       
      cols = ["ASSET", "HOUSES", "INCOME", "DEBT", "EDUC"]
      corr = ...
      corr.style.background_gradient(axis=None)
      [ ]:
      xxxxxxxxxx
       
      ​
      wqet_grader.grade("Project 6 Assessment", "Task 6.1.13", corr)
      [ ]:
      xxxxxxxxxx
       
      VimeoVideo("710858210", h="b679fd1fa5", width=600)
      xxxxxxxxxx
      **Task 6.1.14:** Make a correlation matrix using `df_fear`.

      Task 6.1.14: Make a correlation matrix using df_fear.

      • Create a correlation matrix in pandas.
      [ ]:
      xxxxxxxxxx
       
      corr = ...
      corr.style.background_gradient(axis=None)
      xxxxxxxxxx
      Whoa! There are some pretty important differences here! The relationship between `"DEBT"` and `"HOUSES"` is positive for both datasets, but while the coefficient for `df` is fairly weak at 0.26, the same number for `df_fear` is 0.96. 

      Whoa! There are some pretty important differences here! The relationship between "DEBT" and "HOUSES" is positive for both datasets, but while the coefficient for df is fairly weak at 0.26, the same number for df_fear is 0.96.

      Remember, the closer a correlation coefficient is to 1.0, the more exactly they correspond. In this case, that means the value of the primary residence and the total debt held by the household is getting pretty close to being the same. This suggests that the main source of debt being carried by our "TURNFEAR" folks is their primary residence, which, again, is an intuitive finding.

      "DEBT" and "ASSET" share a similarly striking difference, as do "EDUC" and "DEBT" which, while not as extreme a contrast as the other, is still big enough to catch the interest of our hypothetical banker.

      Let's make some visualizations to show these relationships graphically.

      xxxxxxxxxx
      ### Education

      1.2.5. Education¶

      xxxxxxxxxx
      First, let's start with education levels `"EDUC"`, comparing credit fearful and non-credit fearful groups.

      First, let's start with education levels "EDUC", comparing credit fearful and non-credit fearful groups.

      xxxxxxxxxx
      ![](../images/6.1.15.png)

      Image

      [ ]:
      xxxxxxxxxx
       
      VimeoVideo("710858769", h="2e6596cd4b", width=600)
      xxxxxxxxxx
      **Task 6.1.15:** Create a DataFrame `df_educ` that shows the normalized frequency for education categories for both the credit fearful and non-credit fearful households in the dataset. This will be similar in format to `df_inccat`, but focus on education. **Note** that you don't need to replace the numerical values in `"EDUC"` with the true labels.

      Task 6.1.15: Create a DataFrame df_educ that shows the normalized frequency for education categories for both the credit fearful and non-credit fearful households in the dataset. This will be similar in format to df_inccat, but focus on education. Note that you don't need to replace the numerical values in "EDUC" with the true labels.

          TURNFEAR  EDUC  frequency
      0          0    12   0.257481
      1          0     8   0.192029
      2          0    13   0.149823
      3          0     9   0.129833
      4          0    14   0.096117
      5          0    10   0.051150
      ...
      25         1     5   0.015358
      26         1     2   0.012979
      27         1     3   0.011897
      28         1     1   0.005408
      29         1    -1   0.003245
      
      • Aggregate data in a Series using value_counts in pandas.
      • Aggregate data using the groupby method in pandas.
      • Create a Series in pandas.
      • Rename a Series in pandas.
      • Replace values in a column using pandas.
      • Set and reset the index of a DataFrame in pandas.
      [ ]:
      xxxxxxxxxx
       
      df_educ = ...
      df_educ.head()
      [ ]:
      xxxxxxxxxx
       
      VimeoVideo("710861978", h="81349c4b6a", width=600)
      xxxxxxxxxx
      **Task 6.1.16:** Using seaborn, create a side-by-side bar chart of `df_educ`. Set `hue` to `"TURNFEAR"`, and make sure that the education categories are in the correct order along the x-axis. Label to the x-axis `"Education Level"`, the y-axis `"Frequency (%)"`, and use the title `"Educational Attainment: Credit Fearful vs. Non-fearful"`.

      Task 6.1.16: Using seaborn, create a side-by-side bar chart of df_educ. Set hue to "TURNFEAR", and make sure that the education categories are in the correct order along the x-axis. Label to the x-axis "Education Level", the y-axis "Frequency (%)", and use the title "Educational Attainment: Credit Fearful vs. Non-fearful".

      • Create a bar chart using seaborn.
      [ ]:
      xxxxxxxxxx
       
      # Create bar chart of `df_educ`
      ​
      plt.xlabel("Education Level")
      plt.ylabel("Frequency (%)")
      plt.title("Educational Attainment: Credit Fearful vs. Non-fearful");
      xxxxxxxxxx
      In this plot, we can see that a much higher proportion of credit-fearful respondents have only a high school diploma, while university degrees are more common among the non-credit fearful.

      In this plot, we can see that a much higher proportion of credit-fearful respondents have only a high school diploma, while university degrees are more common among the non-credit fearful.

      xxxxxxxxxx
      ### Debt

      1.2.6. Debt¶

      xxxxxxxxxx
      Let's keep going with some scatter plots that look at debt.

      Let's keep going with some scatter plots that look at debt.

      [ ]:
      xxxxxxxxxx
       
      VimeoVideo("710862939", h="0f6e0fc201", width=600)
      xxxxxxxxxx
      **Task 6.1.17:** Use `df` to make a scatter plot showing the relationship between `DEBT` and `ASSET`.

      Task 6.1.17: Use df to make a scatter plot showing the relationship between DEBT and ASSET.

      • Create a scatter plot with pandas.
      [ ]:
      xxxxxxxxxx
       
      # Create scatter plot of ASSET vs DEBT, df
      ​
      [ ]:
      xxxxxxxxxx
       
      VimeoVideo("710864442", h="2428f1c168", width=600)
      xxxxxxxxxx
      **Task 6.1.18:** Use `df_fear` to make a scatter plot showing the relationship between `DEBT` and `ASSET`.

      Task 6.1.18: Use df_fear to make a scatter plot showing the relationship between DEBT and ASSET.

      • Create a scatter plot with pandas.
      [ ]:
      xxxxxxxxxx
       
      # Create scatter plot of ASSET vs DEBT, df_fear
      ​
      xxxxxxxxxx
      You can see relationship in our `df_fear` graph is flatter than the one in our `df` graph, but they clearly are different. 

      You can see relationship in our df_fear graph is flatter than the one in our df graph, but they clearly are different.

      xxxxxxxxxx
      Let's end with the most striking difference from our matrices, and make some scatter plots showing the difference between `HOUSES` and `DEBT`.

      Let's end with the most striking difference from our matrices, and make some scatter plots showing the difference between HOUSES and DEBT.

      [ ]:
      xxxxxxxxxx
       
      VimeoVideo("710865281", h="2e9fc0d9b9", width=600)
      xxxxxxxxxx
      **Task 6.1.19:** Use `df` to make a scatter plot showing the relationship between `HOUSES` and `DEBT`.

      Task 6.1.19: Use df to make a scatter plot showing the relationship between HOUSES and DEBT.

      • Create a scatter plot with pandas.
      [ ]:
      xxxxxxxxxx
       
      # Create scatter plot of HOUSES vs DEBT, df
      ​
      xxxxxxxxxx
      And make the same scatter plot using `df_fear`. 

      And make the same scatter plot using df_fear.

      [ ]:
      xxxxxxxxxx
       
      VimeoVideo("710870286", h="3cd177965a", width=600)
      xxxxxxxxxx
      **Task 6.1.20:** Use `df_fear` to make a scatter plot showing the relationship between `HOUSES` and `DEBT`.

      Task 6.1.20: Use df_fear to make a scatter plot showing the relationship between HOUSES and DEBT.

      • Create a scatter plot with pandas.
      [ ]:
      xxxxxxxxxx
       
      # Create scatter plot of HOUSES vs DEBT, df_fear
      ​
      xxxxxxxxxx
      The outliers make it a little difficult to see the difference between these two plots, but the relationship is clear enough: our `df_fear` graph shows an almost perfect linear relationship, while our `df` graph shows something a little more muddled. You might also notice that the datapoints on the `df_fear` graph form several little groups. Those are called "clusters," and we'll be talking more about how to analyze clustered data in the next lesson.

      The outliers make it a little difficult to see the difference between these two plots, but the relationship is clear enough: our df_fear graph shows an almost perfect linear relationship, while our df graph shows something a little more muddled. You might also notice that the datapoints on the df_fear graph form several little groups. Those are called "clusters," and we'll be talking more about how to analyze clustered data in the next lesson.

      xxxxxxxxxx
      ---

      Copyright © 2022 WorldQuant University. This content is licensed solely for personal use. Redistribution or publication of this material is strictly prohibited.

      xxxxxxxxxx
      ​

      Usage Guidelines

      This lesson is part of the DS Lab core curriculum. For that reason, this notebook can only be used on your WQU virtual machine.

      This means:

      • ⓧ No downloading this notebook.
      • ⓧ No re-sharing of this notebook with friends or colleagues.
      • ⓧ No downloading the embedded videos in this notebook.
      • ⓧ No re-sharing embedded videos with friends or colleagues.
      • ⓧ No adding this notebook to public or private repositories.
      • ⓧ No uploading this notebook (or screenshots of it) to other websites, including websites for study resources.

      xxxxxxxxxx
      <font size="+3"><strong>6.2. Clustering with Two Features</strong></font>

      6.2. Clustering with Two Features

      xxxxxxxxxx
      In the previous lesson, you explored data from the [Survey of Consumer Finances](https://www.federalreserve.gov/econres/scfindex.htm) (SCF), paying special attention to households that have been turned down for credit or feared being denied credit. In this lesson, we'll build a model to segment those households into distinct clusters, and examine the differences between those clusters. 

      In the previous lesson, you explored data from the Survey of Consumer Finances (SCF), paying special attention to households that have been turned down for credit or feared being denied credit. In this lesson, we'll build a model to segment those households into distinct clusters, and examine the differences between those clusters.

      [1]:
      xxxxxxxxxx
       
      ​
      import matplotlib.pyplot as plt
      import pandas as pd
      import seaborn as sns
      import wqet_grader
      from IPython.display import VimeoVideo
      from sklearn.cluster import KMeans
      from sklearn.metrics import silhouette_score
      from teaching_tools.widgets import ClusterWidget, SCFClusterWidget
      ​
      wqet_grader.init("Project 6 Assessment")
      ​
      [2]:
      xxxxxxxxxx
       
      VimeoVideo("713919442", h="7b4cbc1495", width=600)
      [2]:
      xxxxxxxxxx
      # Prepare Data

      1. Prepare Data¶

      xxxxxxxxxx
      ## Import

      1.1. Import¶

      Just like always, we need to begin by bringing our data into the project. We spent some time in the previous lesson working with a subset of the larger SCF dataset called "TURNFEAR". Let's start with that.

      [3]:
      xxxxxxxxxx
       
      VimeoVideo("713919411", h="fd4fae4013", width=600)
      [3]:
      xxxxxxxxxx
      **Task 6.2.1:** Create a `wrangle` function that takes a path of a CSV file as input, reads the file into a DataFrame, subsets the data to households that have been turned down for credit or feared being denied credit in the past 5 years (see `"TURNFEAR"`), and returns the subset DataFrame. 

      Task 6.2.1: Create a wrangle function that takes a path of a CSV file as input, reads the file into a DataFrame, subsets the data to households that have been turned down for credit or feared being denied credit in the past 5 years (see "TURNFEAR"), and returns the subset DataFrame.

      • Write a function in Python.
      • Subset a DataFrame by selecting one or more columns in pandas.
      [7]:
      xxxxxxxxxx
       
      def wrangle(filepath):
          df = pd.read_csv(filepath)
          mask = df["TURNFEAR"]==1
          df=df[mask]
          return df
      xxxxxxxxxx
      And now that we've got that taken care of, we'll import the data and see what we've got.

      And now that we've got that taken care of, we'll import the data and see what we've got.

      Task 6.2.2: Use your wrangle function to read the file SCFP2019.csv.gz into a DataFrame named df.

      • Read a CSV file into a DataFrame using pandas.
      [8]:
      xxxxxxxxxx
       
      df = wrangle("data/SCFP2019.csv.gz")
      print(df.shape)
      df.head()
      (4623, 351)
      
      [8]:
      YY1 Y1 WGT HHSEX AGE AGECL EDUC EDCL MARRIED KIDS ... NWCAT INCCAT ASSETCAT NINCCAT NINC2CAT NWPCTLECAT INCPCTLECAT NINCPCTLECAT INCQRTCAT NINCQRTCAT
      5 2 21 3790.476607 1 50 3 8 2 1 3 ... 1 2 1 2 1 1 4 4 2 2
      6 2 22 3798.868505 1 50 3 8 2 1 3 ... 1 2 1 2 1 1 4 3 2 2
      7 2 23 3799.468393 1 50 3 8 2 1 3 ... 1 2 1 2 1 1 4 4 2 2
      8 2 24 3788.076005 1 50 3 8 2 1 3 ... 1 2 1 2 1 1 4 4 2 2
      9 2 25 3793.066589 1 50 3 8 2 1 3 ... 1 2 1 2 1 1 4 4 2 2

      5 rows × 351 columns

      xxxxxxxxxx
      ## Explore

      1.2. Explore¶

      xxxxxxxxxx
      We looked at a lot of different features of the `"TURNFEAR"` subset in the last lesson, and the last thing we looked at was the relationship between real estate and debt. To refresh our memory on what that relationship looked like, let's make that graph again.

      We looked at a lot of different features of the "TURNFEAR" subset in the last lesson, and the last thing we looked at was the relationship between real estate and debt. To refresh our memory on what that relationship looked like, let's make that graph again.

      [9]:
      xxxxxxxxxx
       
      VimeoVideo("713919351", h="55dc979d55", width=600)
      [9]:
      xxxxxxxxxx
      **Task 6.2.3:** Create a scatter plot of that shows the total value of primary residence of a household (`"HOUSES"`) as a function of the total value of household debt (`"DEBT"`). Be sure to label your x-axis as `"Household Debt"`, your y-axis as `"Home Value"`, and use the title `"Credit Fearful: Home Value vs. Household Debt"`.

      Task 6.2.3: Create a scatter plot of that shows the total value of primary residence of a household ("HOUSES") as a function of the total value of household debt ("DEBT"). Be sure to label your x-axis as "Household Debt", your y-axis as "Home Value", and use the title "Credit Fearful: Home Value vs. Household Debt".

      • What's a scatter plot?
      • Create a scatter plot using seaborn.
      [12]:
      xxxxxxxxxx
       
      # Plot "HOUSES" vs "DEBT"
      sns.scatterplot(x=df["DEBT"]/1e6, y=df["HOUSES"]/1e6)
      plt.xlabel("Household Debt [$1M]")
      plt.ylabel("Home Value [$1M]")
      plt.title("Credit Fearful: Home Value vs. Household Debt");
      xxxxxxxxxx
      Remember that graph and its clusters? Let's get a little deeper into it.

      Remember that graph and its clusters? Let's get a little deeper into it.

      xxxxxxxxxx
      ## Split

      1.3. Split¶

      xxxxxxxxxx
      We need to split our data, but we're not going to need target vector or a test set this time around. That's because the model we'll be building involves *unsupervised* learning. It's called *unsupervised* because the model doesn't try to map input to a st of labels or targets that already exist. It's kind of like how humans learn new skills, in that we don't always have models to copy. Sometimes, we just try out something and see what happens. Keep in mind that this doesn't make these models any less useful, it just makes them different.

      We need to split our data, but we're not going to need target vector or a test set this time around. That's because the model we'll be building involves unsupervised learning. It's called unsupervised because the model doesn't try to map input to a st of labels or targets that already exist. It's kind of like how humans learn new skills, in that we don't always have models to copy. Sometimes, we just try out something and see what happens. Keep in mind that this doesn't make these models any less useful, it just makes them different.

      So, keeping that in mind, let's do the split.

      [13]:
      xxxxxxxxxx
       
      VimeoVideo("713919336", h="775867f48a", width=600)
      [13]:
      xxxxxxxxxx
      **Task 6.2.4:** Create the feature matrix `X`. It should contain two features only: `"DEBT"` and `"HOUSES"`.

      Task 6.2.4: Create the feature matrix X. It should contain two features only: "DEBT" and "HOUSES".

      • What's a feature matrix?
      • Subset a DataFrame by selecting one or more columns in pandas.
      [15]:
      xxxxxxxxxx
       
      X = df[["HOUSES", "DEBT"]]
      print(X.shape)
      X.head()
      (4623, 2)
      
      [15]:
      HOUSES DEBT
      5 0.0 12200.0
      6 0.0 12600.0
      7 0.0 15300.0
      8 0.0 14100.0
      9 0.0 15400.0
      xxxxxxxxxx
      # Build Model

      2. Build Model¶

      xxxxxxxxxx
      Before we start building the model, let's take a second to talk about something called `KMeans`. 

      Before we start building the model, let's take a second to talk about something called KMeans.

      Take another look at the scatter plot we made at the beginning of this lesson. Remember how the datapoints form little clusters? It turns out we can use an algorithm that partitions the dataset into smaller groups.

      Let's take a look at how those things work together.

      [16]:
      xxxxxxxxxx
       
      VimeoVideo("713919214", h="028502efe7", width=600)
      [16]:
      xxxxxxxxxx
      **Task 6.2.5:** Run the cell below to display the `ClusterWidget`.

      Task 6.2.5: Run the cell below to display the ClusterWidget.

      • What's a centroid?
      • What's a cluster?
      [17]:
      xxxxxxxxxx
       
      cw = ClusterWidget(n_clusters=3)
      cw.show()
      VBox(children=(IntSlider(value=0, continuous_update=False, description='Step:', max=10), Output(layout=Layout(…
      xxxxxxxxxx
      Take a second and run slowly through all the positions on the slider. At the first position, there's whole bunch of gray datapoints, and if you look carefully, you'll see there are also three stars. Those stars are the **centroids**. At first, their position is set randomly. If you move the slider one more position to the right, you'll see all the gray points change colors that correspond to three clusters.

      Take a second and run slowly through all the positions on the slider. At the first position, there's whole bunch of gray datapoints, and if you look carefully, you'll see there are also three stars. Those stars are the centroids. At first, their position is set randomly. If you move the slider one more position to the right, you'll see all the gray points change colors that correspond to three clusters.

      Since a centroid represents the mean value of all the data in the cluster, we would expect it to fall in the center of whatever cluster it's in. That's what will happen if you move the slider one more position to the right. See how the centroids moved?

      Aha! But since they moved, the datapoints might not be in the right clusters anymore. Move the slider again, and you'll see the data points redistribute themselves to better reflect the new position of the centroids. The new clusters mean that the centroids also need to move, which will lead to the clusters changing again, and so on, until all the datapoints end up in the right cluster with a centroid that reflects the mean value of all those points.

      Let's see what happens when we try the same with our "DEBT" and "HOUSES" data.

      [18]:
      xxxxxxxxxx
       
      VimeoVideo("713919177", h="102616b1c3", width=600)
      [18]:
      xxxxxxxxxx
      **Task 6.2.6:** Run the cell below to display the `SCFClusterWidget`.

      Task 6.2.6: Run the cell below to display the SCFClusterWidget.

      [19]:
      xxxxxxxxxx
       
      scfc = SCFClusterWidget(x=df["DEBT"], y=df["HOUSES"], n_clusters=3)
      scfc.show()
      VBox(children=(IntSlider(value=0, continuous_update=False, description='Step:', max=10), Output(layout=Layout(…
      xxxxxxxxxx
      ## Iterate

      2.1. Iterate¶

      Now that you've had a chance to play around with the process a little bit, let's get into how to build a model that does the same thing.

      [20]:
      xxxxxxxxxx
       
      VimeoVideo("713919157", h="0b2c3c95f2", width=600)
      [20]:
      xxxxxxxxxx
      **Task 6.2.7:** Build a `KMeans` model, assign it to the variable name `model`, and fit it to the training data `X`. 

      Task 6.2.7: Build a KMeans model, assign it to the variable name model, and fit it to the training data X.

      • What's k-means clustering?
      • Fit a model to training data in scikit-learn.
      xxxxxxxxxx
      <div class="alert alert-info" role="alert">
      Tip: The k-means clustering algorithm relies on random processes, so don't forget to set a random_state for all your models in this lesson.
      [35]:
      xxxxxxxxxx
       
      # Build model
      model = KMeans(n_clusters=3, random_state=42)
      # Fit model to data
      model.fit(X)
      [35]:
      KMeans(n_clusters=3, random_state=42)
      In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
      On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
      KMeans(n_clusters=3, random_state=42)
      xxxxxxxxxx
      And there it is. 42 datapoints spread across three clusters. Let's grab the labels that the model has assigned to the data points so we can start making a new visualization.

      And there it is. 42 datapoints spread across three clusters. Let's grab the labels that the model has assigned to the data points so we can start making a new visualization.

      [21]:
      xxxxxxxxxx
       
      VimeoVideo("713919137", h="7eafe805ff", width=600)
      [21]:
      xxxxxxxxxx
      **Task 6.2.8:** Extract the labels that your `model` created during training and assign them to the variable `labels`.

      Task 6.2.8: Extract the labels that your model created during training and assign them to the variable labels.

      • Access an object in a pipeline in scikit-learn.
      [36]:
      xxxxxxxxxx
       
      labels = model.labels_
      labels[:10]
      [36]:
      array([0, 0, 0, 0, 0, 0, 0, 0, 0, 0], dtype=int32)
      xxxxxxxxxx
      Using the labels we just extracted, let's recreate the scatter plot from before, this time we'll color each point according to the cluster to which the model assigned it.

      Using the labels we just extracted, let's recreate the scatter plot from before, this time we'll color each point according to the cluster to which the model assigned it.

      [27]:
      xxxxxxxxxx
       
      VimeoVideo("713919104", h="2f6d4285f1", width=600)
      [27]:
      xxxxxxxxxx
      **Task 6.2.9:** Recreate the "Home Value vs. Household Debt" scatter plot you made above, but with two changes. First, use seaborn to create the plot. Second, pass your `labels` to the `hue` argument, and set the `palette` argument to `"deep"`. 

      Task 6.2.9: Recreate the "Home Value vs. Household Debt" scatter plot you made above, but with two changes. First, use seaborn to create the plot. Second, pass your labels to the hue argument, and set the palette argument to "deep".

      • What's a scatter plot?
      • Create a scatter plot using seaborn.
      [37]:
      xxxxxxxxxx
       
      # Plot "HOUSES" vs "DEBT" with hue=label
      sns.scatterplot(x=df["DEBT"]/1e6, y=df["HOUSES"]/1e6, hue=labels, palette="deep")
      plt.xlabel("Household Debt [$1M]")
      plt.ylabel("Home Value [$1M]")
      plt.title("Credit Fearful: Home Value vs. Household Debt");
      xxxxxxxxxx
      Nice! Each cluster has its own color. The centroids are still missing, so let's pull those out.

      Nice! Each cluster has its own color. The centroids are still missing, so let's pull those out.

      [30]:
      xxxxxxxxxx
       
      VimeoVideo("713919087", h="9b8635c9a8", width=600)
      [30]:
      xxxxxxxxxx
      **Task 6.2.10:** Extract the centroids that your `model` created during training, and assign them to the variable `centroids`. 

      Task 6.2.10: Extract the centroids that your model created during training, and assign them to the variable centroids.

      • What's a centroid?
      [38]:
      xxxxxxxxxx
       
      centroids = model.cluster_centers_
      centroids
      [38]:
      array([[  116150.29328698,    91017.57766674],
             [34484000.        , 18384100.        ],
             [11666666.66666667,  5065800.        ]])
      xxxxxxxxxx
      Let's add the centroids to the graph.

      Let's add the centroids to the graph.

      [32]:
      xxxxxxxxxx
       
      VimeoVideo("713919002", h="08cba14f6b", width=600)
      [32]:
      xxxxxxxxxx
      **Task 6.2.11:** Recreate the seaborn "Home Value vs. Household Debt" scatter plot you just made, but with one difference: Add the `centroids` to the plot. Be sure to set the centroids color to `"gray"`.

      Task 6.2.11: Recreate the seaborn "Home Value vs. Household Debt" scatter plot you just made, but with one difference: Add the centroids to the plot. Be sure to set the centroids color to "gray".

      • What's a scatter plot?
      • Create a scatter plot using seaborn.
      [40]:
      xxxxxxxxxx
       
      # Plot "HOUSES" vs "DEBT", add centroids
      sns.scatterplot(x=df["DEBT"]/1e6, y=df["HOUSES"]/1e6, hue=labels, palette="deep")
      plt.scatter(
          x = centroids[:,1] / 1e6,
          y = centroids[:,0] / 1e6,
          marker = "*",
          s=150
      )
      plt.xlabel("Household Debt [$1M]")
      plt.ylabel("Home Value [$1M]")
      plt.title("Credit Fearful: Home Value vs. Household Debt");
      xxxxxxxxxx
      That looks great, but let's not pat ourselves on the back just yet. Even though our graph makes it *look* like the clusters are correctly assigned but, as data scientists, we need a numerical evaluation. The data we're using is pretty clear-cut, but if things were a little more muddled, we'd want to run some calculations to make sure we got everything right.

      That looks great, but let's not pat ourselves on the back just yet. Even though our graph makes it look like the clusters are correctly assigned but, as data scientists, we need a numerical evaluation. The data we're using is pretty clear-cut, but if things were a little more muddled, we'd want to run some calculations to make sure we got everything right.

      There are two metrics that we'll use to evaluate our clusters. We'll start with inertia, which measure the distance between the points within the same cluster.

      [41]:
      xxxxxxxxxx
       
      VimeoVideo("713918749", h="bfc741b1e7", width=600)
      [41]:
      xxxxxxxxxx
      <div class="alert alert-info" role="alert">

      Question: What do those double bars in the equation mean?

      Answer: It's the L2 norm, that is, the non-negative Euclidean distance between each datapoint and its centroid. In Python, it would be something like sqrt((x1-c)**2 + (x2-c)**2) + ...).

      Many thanks to Aghogho Esuoma Monorien for his comment in the forum! 🙏

      xxxxxxxxxx
      **Task 6.2.12:** Extract the inertia for your `model` and assign it to the variable `inertia`.

      Task 6.2.12: Extract the inertia for your model and assign it to the variable inertia.

      • What's inertia?
      • Access an object in a pipeline in scikit-learn.
      • Calculate the inertia for a model in scikit-learn.
      [42]:
      xxxxxxxxxx
       
      inertia = model.inertia_
      print("Inertia (3 clusters):", inertia)
      Inertia (3 clusters): 939554010797059.4
      
      xxxxxxxxxx
      The "best" inertia is 0, and our score is pretty far from that. Does that mean our model is "bad?" Not necessarily. Inertia is a measurement of distance (like mean absolute error from Project 2). This means that the unit of measurement for inertia depends on the unit of measurement of our x- and y-axes. And since `"DEBT"` and `"HOUSES"` are measured in tens of millions of dollars, it's not surprising that inertia is so large. 

      The "best" inertia is 0, and our score is pretty far from that. Does that mean our model is "bad?" Not necessarily. Inertia is a measurement of distance (like mean absolute error from Project 2). This means that the unit of measurement for inertia depends on the unit of measurement of our x- and y-axes. And since "DEBT" and "HOUSES" are measured in tens of millions of dollars, it's not surprising that inertia is so large.

      However, it would be helpful to have metric that was easier to interpret, and that's where silhouette score comes in. Silhouette score measures the distance between different clusters. It ranges from -1 (the worst) to 1 (the best), so it's easier to interpret than inertia. WQU WorldQuant University Applied Data Science Lab QQQQ

      [43]:
      xxxxxxxxxx
       
      VimeoVideo("713918501", h="0462c4784a", width=600)
      [43]:
      xxxxxxxxxx
      **Task 6.2.13:** Calculate the silhouette score for your model and assign it to the variable `ss`.

      Task 6.2.13: Calculate the silhouette score for your model and assign it to the variable ss.

      • What's silhouette score?
      • Calculate the silhouette score for a model in scikit-learn.
      [45]:
      xxxxxxxxxx
       
      ss = silhouette_score(X, model.labels_)
      print("Silhouette Score (3 clusters):", ss)
      Silhouette Score (3 clusters): 0.9768842462944348
      
      xxxxxxxxxx
      Outstanding! 0.976 is pretty close to 1, so our model has done a good job at identifying 3 clusters that are far away from each other.

      Outstanding! 0.976 is pretty close to 1, so our model has done a good job at identifying 3 clusters that are far away from each other.

      xxxxxxxxxx
      It's important to remember that these performance metrics are the result of the number of clusters we told our model to create. In unsupervised learning, the number of clusters is hyperparameter that you set before training your model. So what would happen if we change the number of clusters? Will it lead to better performance? Let's try!

      It's important to remember that these performance metrics are the result of the number of clusters we told our model to create. In unsupervised learning, the number of clusters is hyperparameter that you set before training your model. So what would happen if we change the number of clusters? Will it lead to better performance? Let's try!

      [46]:
      xxxxxxxxxx
       
      VimeoVideo("713918420", h="e16f3735c7", width=600)
      [46]:
      xxxxxxxxxx
      **Task 6.2.14:** Use a `for` loop to build and train a K-Means model where `n_clusters` ranges from 2 to 12 (inclusive). Each time a model is trained, calculate the inertia and add it to the list `inertia_errors`, then calculate the silhouette score and add it to the list `silhouette_scores`.

      Task 6.2.14: Use a for loop to build and train a K-Means model where n_clusters ranges from 2 to 12 (inclusive). Each time a model is trained, calculate the inertia and add it to the list inertia_errors, then calculate the silhouette score and add it to the list silhouette_scores.

      xxxxxxxxxx
      - [Write a `for` loop in Python.](../%40textbook/01-python-getting-started.ipynb#Working-with-for-Loops)
      • Write a for loop in Python.
      • Calculate the inertia for a model in scikit-learn.
      • Calculate the silhouette score for a model in scikit-learn.
      [49]:
      xxxxxxxxxx
       
      n_clusters = range(2,13)
      inertia_errors = []
      silhouette_scores = []
      ​
      # Add `for` loop to train model and calculate inertia, silhouette score.
      ​
      for k in n_clusters:
          model= KMeans(n_clusters=k, random_state=42)
          
          model.fit(X)
          
          inertia_errors.append(model.inertia_)
          silhouette_scores.append(silhouette_score(X, model.labels_) )           
      print("Inertia:", inertia_errors)
      print()
      print("Silhouette Scores:", silhouette_scores)
      Inertia: [3018038313336857.5, 939554010797059.4, 546098841715646.25, 309310386410913.3, 235243397481784.3, 182225729179703.53, 150670779013790.4, 114321995931021.89, 100340259483919.02, 86229997033602.88, 74757234072100.36]
      
      Silhouette Scores: [0.9855099957519555, 0.9768842462944348, 0.9490311483406091, 0.839330043242819, 0.7287406719898627, 0.726989114305748, 0.7263840026889208, 0.7335125606476427, 0.692157992955073, 0.6949309528556856, 0.6951831031001252]
      
      xxxxxxxxxx
      Now that we have both performance metrics for several different settings of `n_clusters`, let's make some line plots to see the relationship between the number of clusters in a model and its inertia and silhouette scores.

      Now that we have both performance metrics for several different settings of n_clusters, let's make some line plots to see the relationship between the number of clusters in a model and its inertia and silhouette scores.

      [47]:
      xxxxxxxxxx
       
      VimeoVideo("713918224", h="32ff34ffa1", width=600)
      [47]:
      xxxxxxxxxx
      **Task 6.2.15:** Create a line plot that shows the values of `inertia_errors` as a function of `n_clusters`. Be sure to label your x-axis `"Number of Clusters"`, your y-axis `"Inertia"`, and use the title `"K-Means Model: Inertia vs Number of Clusters"`.

      Task 6.2.15: Create a line plot that shows the values of inertia_errors as a function of n_clusters. Be sure to label your x-axis "Number of Clusters", your y-axis "Inertia", and use the title "K-Means Model: Inertia vs Number of Clusters".

      • Create a line plot in Matplotlib.
      [50]:
      xxxxxxxxxx
       
      # Plot `inertia_errors` by `n_clusters`
      plt.plot(n_clusters, inertia_errors)
      plt.xlabel("n_clusters")
      plt.ylabel("inertia_errors")
      [50]:
      Text(0, 0.5, 'inertia_errors')
      xxxxxxxxxx
      What we're seeing here is that, as the number of clusters increases, inertia goes down. In fact, we could get inertia to 0 if we told our model to make 4,623 clusters (the same number of observations in `X`), but those clusters wouldn't be helpful to us.

      What we're seeing here is that, as the number of clusters increases, inertia goes down. In fact, we could get inertia to 0 if we told our model to make 4,623 clusters (the same number of observations in X), but those clusters wouldn't be helpful to us.

      The trick with choosing the right number of clusters is to look for the "bend in the elbow" for this plot. In other words, we want to pick the point where the drop in inertia becomes less dramatic and the line begins to flatten out. In this case, it looks like the sweet spot is 4 or 5.

      Let's see what the silhouette score looks like.

      [51]:
      xxxxxxxxxx
       
      VimeoVideo("713918153", h="3f3a1312d2", width=600)
      [51]:
      xxxxxxxxxx
      **Task 6.2.16:** Create a line plot that shows the values of `silhouette_scores` as a function of `n_clusters`. Be sure to label your x-axis `"Number of Clusters"`, your y-axis `"Silhouette Score"`, and use the title `"K-Means Model: Silhouette Score vs Number of Clusters"`.

      Task 6.2.16: Create a line plot that shows the values of silhouette_scores as a function of n_clusters. Be sure to label your x-axis "Number of Clusters", your y-axis "Silhouette Score", and use the title "K-Means Model: Silhouette Score vs Number of Clusters".

      • Create a line plot in Matplotlib.
      [52]:
      xxxxxxxxxx
       
      # Plot `silhouette_scores` vs `n_clusters`
      plt.plot(n_clusters, silhouette_scores)
      [52]:
      [<matplotlib.lines.Line2D at 0x7fb7c4de06a0>]
      xxxxxxxxxx
      Note that, in contrast to our inertia plot, bigger is better. So we're not looking for a "bend in the elbow" but rather a number of clusters for which the silhouette score still remains high. We can see that silhouette score drops drastically beyond 4 clusters. Given this and what we saw in the inertia plot, it looks like the optimal number of clusters is 4. 

      Note that, in contrast to our inertia plot, bigger is better. So we're not looking for a "bend in the elbow" but rather a number of clusters for which the silhouette score still remains high. We can see that silhouette score drops drastically beyond 4 clusters. Given this and what we saw in the inertia plot, it looks like the optimal number of clusters is 4.

      Now that we've decided on the final number of clusters, let's build a final model.

      [53]:
      xxxxxxxxxx
       
      VimeoVideo("713918108", h="e6aa88569e", width=600)
      [53]:
      xxxxxxxxxx
      **Task 6.2.17:** Build and train a new k-means model named `final_model`. Use the information you gained from the two plots above to set an appropriate value for the `n_clusters` argument. Once you've built and trained your model, submit it to the grader for evaluation. 

      Task 6.2.17: Build and train a new k-means model named final_model. Use the information you gained from the two plots above to set an appropriate value for the n_clusters argument. Once you've built and trained your model, submit it to the grader for evaluation.

      • Fit a model to training data in scikit-learn.
      [54]:
      xxxxxxxxxx
       
      # Build model
      final_model = KMeans(n_clusters=4, random_state=42)
      # Fit model to data
      final_model.fit(X)
      [54]:
      KMeans(n_clusters=4, random_state=42)
      In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
      On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
      KMeans(n_clusters=4, random_state=42)
      [55]:
      xxxxxxxxxx
       
      ​
      wqet_grader.grade("Project 6 Assessment", "Task 6.2.17", final_model)

      Excellent work.

      Score: 1

      xxxxxxxxxx
      (In case you're wondering, we don't need an *Evaluate* section in this notebook because we don't have any test data to evaluate our model with.)

      (In case you're wondering, we don't need an Evaluate section in this notebook because we don't have any test data to evaluate our model with.)

      xxxxxxxxxx
      # Communicate

      3. Communicate¶

      [56]:
      xxxxxxxxxx
       
      VimeoVideo("713918073", h="3929b58011", width=600)
      [56]:
      xxxxxxxxxx
      **Task 6.2.18:** Create one last "Home Value vs. Household Debt" scatter plot that shows the clusters that your `final_model` has assigned to the training data. 

      Task 6.2.18: Create one last "Home Value vs. Household Debt" scatter plot that shows the clusters that your final_model has assigned to the training data.

      • What's a scatter plot?
      • Create a scatter plot using Matplotlib.
      [61]:
      xxxxxxxxxx
       
      # Plot "HOUSES" vs "DEBT" with final_model labels
      sns.scatterplot(x= df["DEBT"]/1e6,
                     y= df["HOUSES"]/1e6,
                     hue=final_model.labels_)
      plt.xlabel("Household Debt [$1M]")
      plt.ylabel("Home Value [$1M]")
      plt.title("Credit Fearful: Home Value vs. Household Debt");
      xxxxxxxxxx
      Nice! You can see all four of our clusters, each differentiated from the rest by color.

      Nice! You can see all four of our clusters, each differentiated from the rest by color.

      We're going to make one more visualization, converting the cluster analysis we just did to something a little more actionable: a side-by-side bar chart. In order to do that, we need to put our clustered data into a DataFrame.

      [57]:
      xxxxxxxxxx
       
      VimeoVideo("713918023", h="110156bd98", width=600)
      [57]:
      xxxxxxxxxx
      **Task 6.2.19:** Create a DataFrame `xgb` that contains the mean `"DEBT"` and `"HOUSES"` values for each of the clusters in your `final_model`.

      Task 6.2.19: Create a DataFrame xgb that contains the mean "DEBT" and "HOUSES" values for each of the clusters in your final_model.

      • Access an object in a pipeline in scikit-learn.
      • Aggregate data using the groupby method in pandas.
      • Create a DataFrame from a Series in pandas.
      [66]:
      xxxxxxxxxx
       
      xgb = X.groupby(final_model.labels_).mean()
      xgb
      [66]:
      HOUSES DEBT
      0 1.031872e+05 8.488629e+04
      1 3.448400e+07 1.838410e+07
      2 1.407400e+07 5.472800e+06
      3 4.551429e+06 2.420929e+06
      [67]:
      xxxxxxxxxx
       
      final_model.cluster_centers_
      [67]:
      array([[  103187.22476563,    84886.28951384],
             [34484000.        , 18384100.        ],
             [14074000.        ,  5472800.        ],
             [ 4551428.57142857,  2420928.57142857]])
      xxxxxxxxxx
      Before you move to the next task, print out the `cluster_centers_` for your `final_model`. Do you see any similarities between them and the DataFrame you just made? Why do you think that is?

      Before you move to the next task, print out the cluster_centers_ for your final_model. Do you see any similarities between them and the DataFrame you just made? Why do you think that is?

      [62]:
      xxxxxxxxxx
       
      VimeoVideo("713917740", h="bcc496c2d9", width=600)
      [62]:
      xxxxxxxxxx
      **Task 6.2.20:** Create a side-by-side bar chart from `xgb` that shows the mean `"DEBT"` and `"HOUSES"` values for each of the clusters in your `final_model`. For readability, you'll want to divide the values in `xgb` by 1 million. Be sure to label the x-axis `"Cluster"`, the y-axis `"Value [$1 million]"`, and use the title `"Mean Home Value & Household Debt by Cluster"`.

      Task 6.2.20: Create a side-by-side bar chart from xgb that shows the mean "DEBT" and "HOUSES" values for each of the clusters in your final_model. For readability, you'll want to divide the values in xgb by 1 million. Be sure to label the x-axis "Cluster", the y-axis "Value [$1 million]", and use the title "Mean Home Value & Household Debt by Cluster".

      • Create a bar chart using pandas.
      [69]:
      xxxxxxxxxx
       
      # Create side-by-side bar chart of `xgb`
      xgb.plot(kind="bar")
      plt.xlabel("Cluster")
      plt.ylabel("Value [$1 million]")
      plt.title("Mean Home Value & Household Debt by Cluster");
      [71]:
      xxxxxxxxxx
       
      (xgb["DEBT"]/xgb["HOUSES"]).plot(kind= "bar")
      [71]:
      <AxesSubplot:>
      xxxxxxxxxx
      In this plot, we have our four clusters spread across the x-axis, and the dollar amounts for home value and household debt on the y-axis. 

      In this plot, we have our four clusters spread across the x-axis, and the dollar amounts for home value and household debt on the y-axis.

      The first thing to look at in this chart is the different mean home values for the five clusters. Clusters 0 represents households with small to moderate home values, clusters 2 and 3 have high home values, and cluster 1 has extremely high values.

      The second thing to look at is the proportion of debt to home value. In clusters 1 and 3, this proportion is around 0.5. This suggests that these groups have a moderate amount of untapped equity in their homes. But for group 0, it's almost 1, which suggests that the largest source of household debt is their mortgage. Group 2 is unique in that they have the smallest proportion of debt to home value, around 0.4.

      This information could be useful to financial institution that want to target customers with products that would appeal to them. For instance, households in group 0 might be interested in refinancing their mortgage to lower their interest rate. Group 2 households could be interested in a home equity line of credit because they have more equity in their homes. And the bankers, Bill Gates, and Beyoncés in group 1 might want white-glove personalized wealth management.

      xxxxxxxxxx
      ---

      Copyright 2022 WorldQuant University. This content is licensed solely for personal use. Redistribution or publication of this material is strictly prohibited.

      xxxxxxxxxx
      ​

      Usage Guidelines

      This lesson is part of the DS Lab core curriculum. For that reason, this notebook can only be used on your WQU virtual machine.

      This means:

      • ⓧ No downloading this notebook.
      • ⓧ No re-sharing of this notebook with friends or colleagues.
      • ⓧ No downloading the embedded videos in this notebook.
      • ⓧ No re-sharing embedded videos with friends or colleagues.
      • ⓧ No adding this notebook to public or private repositories.
      • ⓧ No uploading this notebook (or screenshots of it) to other websites, including websites for study resources.

      xxxxxxxxxx
      <font size="+3"><strong>6.3. Clustering with Multiple Features</strong></font>

      6.3. Clustering with Multiple Features

      xxxxxxxxxx
      In the previous lesson, we built a K-Means model to create clusters of respondents to the Survey of Consumer Finances. We made our clusters by looking at two features only, but there are hundreds of features in the dataset that we didn't take into account and that could contain valuable information. In this lesson, we'll examine all the features, selecting five to create clusters with. After we build our model and choose an appropriate number of clusters, we'll learn how to visualize multi-dimensional clusters in a 2D scatter plot using something called principal component analysis (PCA). 

      In the previous lesson, we built a K-Means model to create clusters of respondents to the Survey of Consumer Finances. We made our clusters by looking at two features only, but there are hundreds of features in the dataset that we didn't take into account and that could contain valuable information. In this lesson, we'll examine all the features, selecting five to create clusters with. After we build our model and choose an appropriate number of clusters, we'll learn how to visualize multi-dimensional clusters in a 2D scatter plot using something called principal component analysis (PCA).

      [1]:
      xxxxxxxxxx
       
      ​
      import pandas as pd
      import plotly.express as px
      import wqet_grader
      from IPython.display import VimeoVideo
      from scipy.stats.mstats import trimmed_var
      from sklearn.cluster import KMeans
      from sklearn.decomposition import PCA
      from sklearn.metrics import silhouette_score
      from sklearn.pipeline import make_pipeline
      from sklearn.preprocessing import StandardScaler
      ​
      wqet_grader.init("Project 6 Assessment")
      ​
      [3]:
      xxxxxxxxxx
       
      VimeoVideo("714612789", h="f4f8c10683", width=600)
      [3]:
      xxxxxxxxxx
      # Prepare Data

      1. Prepare Data¶

      xxxxxxxxxx
      ## Import

      1.1. Import¶

      xxxxxxxxxx
      We spent some time in the last lesson zooming in on a useful subset of the SCF, and this time, we're going to zoom in even further. One of the persistent issues we've had with this dataset is that it includes some outliers in the form of ultra-wealthy households. This didn't make much of a difference for our last analysis, but it could pose a problem in this lesson, so we're going to focus on families with net worth under \\$2 million.

      We spent some time in the last lesson zooming in on a useful subset of the SCF, and this time, we're going to zoom in even further. One of the persistent issues we've had with this dataset is that it includes some outliers in the form of ultra-wealthy households. This didn't make much of a difference for our last analysis, but it could pose a problem in this lesson, so we're going to focus on families with net worth under \$2 million.

      [4]:
      xxxxxxxxxx
       
      VimeoVideo("714612746", h="07dc57f72c", width=600)
      [4]:
      xxxxxxxxxx
      **Task 6.3.1:** Rewrite your `wrangle` function from the last lesson so that it returns a DataFrame of households whose net worth is less than \\$2 million and that have been turned down for credit or feared being denied credit in the past 5 years (see `"TURNFEAR"`). 

      Task 6.3.1: Rewrite your wrangle function from the last lesson so that it returns a DataFrame of households whose net worth is less than \$2 million and that have been turned down for credit or feared being denied credit in the past 5 years (see "TURNFEAR").

      • Write a function in Python.
      • Subset a DataFrame by selecting one or more columns in pandas.
      [2]:
      xxxxxxxxxx
       
      def wrangle(filepath):
          df = pd.read_csv(filepath)
          mask =(df["TURNFEAR"]==1) & (df["NETWORTH"] < 2e6)
          df=df[mask]
          return df
      [3]:
      xxxxxxxxxx
       
      df = wrangle("data/SCFP2019.csv.gz")
      print(df.shape)
      df.head()
      (4418, 351)
      
      [3]:
      YY1 Y1 WGT HHSEX AGE AGECL EDUC EDCL MARRIED KIDS ... NWCAT INCCAT ASSETCAT NINCCAT NINC2CAT NWPCTLECAT INCPCTLECAT NINCPCTLECAT INCQRTCAT NINCQRTCAT
      5 2 21 3790.476607 1 50 3 8 2 1 3 ... 1 2 1 2 1 1 4 4 2 2
      6 2 22 3798.868505 1 50 3 8 2 1 3 ... 1 2 1 2 1 1 4 3 2 2
      7 2 23 3799.468393 1 50 3 8 2 1 3 ... 1 2 1 2 1 1 4 4 2 2
      8 2 24 3788.076005 1 50 3 8 2 1 3 ... 1 2 1 2 1 1 4 4 2 2
      9 2 25 3793.066589 1 50 3 8 2 1 3 ... 1 2 1 2 1 1 4 4 2 2

      5 rows × 351 columns

      xxxxxxxxxx
      ## Explore

      1.2. Explore¶

      xxxxxxxxxx
      In this lesson, we want to make clusters using more than two features, but which of the 351 features should we choose? Often times, this decision will be made for you. For example, a stakeholder could give you a list of the features that are most important to them. If you don't have that limitation, though, another way to choose the best features for clustering is to determine which numerical features have the largest **variance**. That's what we'll do here. 

      In this lesson, we want to make clusters using more than two features, but which of the 351 features should we choose? Often times, this decision will be made for you. For example, a stakeholder could give you a list of the features that are most important to them. If you don't have that limitation, though, another way to choose the best features for clustering is to determine which numerical features have the largest variance. That's what we'll do here.

      [7]:
      xxxxxxxxxx
       
      VimeoVideo("714612679", h="040facf6e2", width=600)
      [7]:
      xxxxxxxxxx
      **Task 6.3.2:** Calculate the variance for all the features in `df`, and create a Series `top_ten_var` with the 10 features with the largest variance.

      Task 6.3.2: Calculate the variance for all the features in df, and create a Series top_ten_var with the 10 features with the largest variance.

      • What's variance?
      • Calculate the variance of a DataFrame or Series in pandas.
      [8]:
      xxxxxxxxxx
       
      # Calculate variance, get 10 largest features
      top_ten_var = df.var().sort_values().tail(10)
      top_ten_var
      [8]:
      PLOAN1      1.140894e+10
      ACTBUS      1.251892e+10
      BUS         1.256643e+10
      KGTOTAL     1.346475e+10
      DEBT        1.848252e+10
      NHNFIN      2.254163e+10
      HOUSES      2.388459e+10
      NETWORTH    4.847029e+10
      NFIN        5.713939e+10
      ASSET       8.303967e+10
      dtype: float64
      xxxxxxxxxx
      As usual, it's harder to make sense of a list like this than it would be if we visualized it, so let's make a graph. 

      As usual, it's harder to make sense of a list like this than it would be if we visualized it, so let's make a graph.

      [9]:
      xxxxxxxxxx
       
      VimeoVideo("714612647", h="5ecf36a0db", width=600)
      [9]:
      xxxxxxxxxx
      **Task 6.3.3:** Use plotly express to create a horizontal bar chart of `top_ten_var`. Be sure to label your x-axis `"Variance"`, the y-axis `"Feature"`, and use the title `"SCF: High Variance Features"`.

      Task 6.3.3: Use plotly express to create a horizontal bar chart of top_ten_var. Be sure to label your x-axis "Variance", the y-axis "Feature", and use the title "SCF: High Variance Features".

      • What's a bar chart?
      • Create a bar chart using plotly express.
      [10]:
      xxxxxxxxxx
       
      # Create horizontal bar chart of `top_ten_var`
      fig = px.bar(x=top_ten_var, y= top_ten_var.index, title="SCF: High variance features")
      ​
      fig.update_layout(xaxis_title="Variance", yaxis_title = "Feature")
      ​
      fig.show()
      xxxxxxxxxx
      One thing that we've seen throughout this project is that many of the wealth indicators are highly skewed, with a few outlier households having enormous wealth. Those outliers can affect our measure of variance. Let's see if that's the case with one of the features from `top_five_var`.

      One thing that we've seen throughout this project is that many of the wealth indicators are highly skewed, with a few outlier households having enormous wealth. Those outliers can affect our measure of variance. Let's see if that's the case with one of the features from top_five_var.

      [11]:
      xxxxxxxxxx
       
      VimeoVideo("714612615", h="9ae23890fc", width=600)
      [11]:
      xxxxxxxxxx
      **Task 6.3.4:** Use plotly express to create a horizontal boxplot of `"NHNFIN"` to determine if the values are skewed. Be sure to label the x-axis `"Value [$]"`, and use the title `"Distribution of Non-home, Non-Financial Assets"`.

      Task 6.3.4: Use plotly express to create a horizontal boxplot of "NHNFIN" to determine if the values are skewed. Be sure to label the x-axis "Value [$]", and use the title "Distribution of Non-home, Non-Financial Assets".

      • What's a boxplot?
      • Create a boxplot using plotly express.
      [42]:
      xxxxxxxxxx
       
      # Create a boxplot of `NHNFIN`
      fig = px.box(
          data_frame= df,
          x="KGTOTAL"
      )
      ​
      fig.show()
      ​
      ​
      xxxxxxxxxx
      Whoa! The dataset is massively right-skewed because of the huge outliers on the right side of the distribution. Even though we already excluded households with a high net worth with our `wrangle` function, the variance is still being distorted by some extreme outliers.

      Whoa! The dataset is massively right-skewed because of the huge outliers on the right side of the distribution. Even though we already excluded households with a high net worth with our wrangle function, the variance is still being distorted by some extreme outliers.

      The best way to deal with this is to look at the trimmed variance, where we remove extreme values before calculating variance. We can do this using the trimmed_variance function from the SciPy library.

      [13]:
      xxxxxxxxxx
       
      VimeoVideo("714612570", h="b1be8fb750", width=600)
      [13]:
      xxxxxxxxxx
      **Task 6.3.5:** Calculate the trimmed variance for the features in `df`. Your calculations should not include the top and bottom 10% of observations. Then create a Series `top_ten_trim_var` with the 10 features with the largest variance.

      Task 6.3.5: Calculate the trimmed variance for the features in df. Your calculations should not include the top and bottom 10% of observations. Then create a Series top_ten_trim_var with the 10 features with the largest variance.

      • What's trimmed variance?
      • Calculate the trimmed variance of data using SciPy.
      • Apply a function to a DataFrame in pandas.
      [7]:
      xxxxxxxxxx
       
      # Calculate trimmed variance
      top_ten_trim_var = df.apply(trimmed_var, limits=(0.1,0.1)).sort_values().tail(10)
      top_ten_trim_var
      [7]:
      WAGEINC     5.550737e+08
      HOMEEQ      7.338377e+08
      NH_MORT     1.333125e+09
      MRTHEL      1.380468e+09
      PLOAN1      1.441968e+09
      DEBT        3.089865e+09
      NETWORTH    3.099929e+09
      HOUSES      4.978660e+09
      NFIN        8.456442e+09
      ASSET       1.175370e+10
      dtype: float64
      xxxxxxxxxx
      Okay! Now that we've got a better set of numbers, let's make another bar graph.

      Okay! Now that we've got a better set of numbers, let's make another bar graph.

      [33]:
      xxxxxxxxxx
       
      VimeoVideo("714611188", h="d762a98b1e", width=600)
      [33]:
      xxxxxxxxxx
      **Task 6.3.6:** Use plotly express to create a horizontal bar chart of `top_ten_trim_var`. Be sure to label your x-axis `"Trimmed Variance"`, the y-axis `"Feature"`, and use the title `"SCF: High Variance Features"`.

      Task 6.3.6: Use plotly express to create a horizontal bar chart of top_ten_trim_var. Be sure to label your x-axis "Trimmed Variance", the y-axis "Feature", and use the title "SCF: High Variance Features".

      • What's a bar chart?
      • Create a bar chart using plotly express.
      [34]:
      xxxxxxxxxx
       
      # Create horizontal bar chart of `top_ten_trim_var` 
      fig = px.bar(x=top_ten_trim_var, y= top_ten_trim_var.index, title="SCF: High variance features")
      ​
      fig.update_layout(xaxis_title="Variance", yaxis_title = "Feature")
      ​
      fig.show()
      xxxxxxxxxx
      There are three things to notice in this plot. First, the variances have decreased a lot. In our previous chart, the x-axis went up to \\$80 billion; this one goes up to \\$12 billion. Second, the top 10 features have changed a bit. All the features relating to business ownership (`"...BUS"`) are gone. Finally, we can see that there are big differences in variance from feature to feature. For example, the variance for `"WAGEINC"` is around than \\$500 million, while the variance for `"ASSET"` is nearly \\$12 billion. In other words, these features have completely different scales. This is something that we'll need to address before we can make good clusters. 

      There are three things to notice in this plot. First, the variances have decreased a lot. In our previous chart, the x-axis went up to \$80 billion; this one goes up to \$12 billion. Second, the top 10 features have changed a bit. All the features relating to business ownership ("...BUS") are gone. Finally, we can see that there are big differences in variance from feature to feature. For example, the variance for "WAGEINC" is around than \$500 million, while the variance for "ASSET" is nearly \$12 billion. In other words, these features have completely different scales. This is something that we'll need to address before we can make good clusters.

      [43]:
      xxxxxxxxxx
       
      VimeoVideo("714611161", h="61dee490ee", width=600)
      [43]:
      xxxxxxxxxx
      **Task 6.3.7:** Generate a list `high_var_cols` with the column names of the  five features with the highest trimmed variance.

      Task 6.3.7: Generate a list high_var_cols with the column names of the five features with the highest trimmed variance.

      • What's an index?
      • Access the index of a DataFrame or Series in pandas.
      [8]:
      xxxxxxxxxx
       
      high_var_cols = top_ten_trim_var.tail(5).index.tolist()
      high_var_cols
      [8]:
      ['DEBT', 'NETWORTH', 'HOUSES', 'NFIN', 'ASSET']
      xxxxxxxxxx
      ## Split

      1.3. Split¶

      xxxxxxxxxx
      Now that we've gotten our data to a place where we can use it, we can follow the steps we've used before to build a model, starting with a feature matrix.

      Now that we've gotten our data to a place where we can use it, we can follow the steps we've used before to build a model, starting with a feature matrix.

      [54]:
      xxxxxxxxxx
       
      VimeoVideo("714611148", h="f7fbd4bcc5", width=600)
      [54]:
      xxxxxxxxxx
      **Task 6.3.8:** Create the feature matrix `X`. It should contain the five columns in `high_var_cols`.

      Task 6.3.8: Create the feature matrix X. It should contain the five columns in high_var_cols.

      • What's a feature matrix?
      • Subset a DataFrame by selecting one or more columns in pandas.
      [9]:
      xxxxxxxxxx
       
      X = df[high_var_cols]
      print("X shape:", X.shape)
      X.head()
      X shape: (4418, 5)
      
      [9]:
      DEBT NETWORTH HOUSES NFIN ASSET
      5 12200.0 -6710.0 0.0 3900.0 5490.0
      6 12600.0 -4710.0 0.0 6300.0 7890.0
      7 15300.0 -8115.0 0.0 5600.0 7185.0
      8 14100.0 -2510.0 0.0 10000.0 11590.0
      9 15400.0 -5715.0 0.0 8100.0 9685.0
      xxxxxxxxxx
      # Build Model

      2. Build Model¶

      xxxxxxxxxx
      ## Iterate

      2.1. Iterate¶

      xxxxxxxxxx
      During our EDA, we saw that we had a scale issue among our features. That issue can make it harder to cluster the data, so we'll need to fix that to help our analysis along. One strategy we can use is **standardization**, a statistical method for putting all the variables in a dataset on the same scale. Let's explore how that works here. Later, we'll incorporate it into our model pipeline. 

      During our EDA, we saw that we had a scale issue among our features. That issue can make it harder to cluster the data, so we'll need to fix that to help our analysis along. One strategy we can use is standardization, a statistical method for putting all the variables in a dataset on the same scale. Let's explore how that works here. Later, we'll incorporate it into our model pipeline.

      [56]:
      xxxxxxxxxx
       
      VimeoVideo("714611113", h="3671a603b5", width=600)
      [56]:
      xxxxxxxxxx
      **Task 6.3.9:** Create a DataFrame `X_summary` with the mean and standard deviation for all the features in `X`.

      Task 6.3.9: Create a DataFrame X_summary with the mean and standard deviation for all the features in X.

      • Aggregate data in a DataFrame using one or more functions in pandas.
      [63]:
      xxxxxxxxxx
       
      X_summary = X.aggregate(["mean","std"])
      X_summary
      [63]:
      DEBT NETWORTH HOUSES NFIN ASSET
      mean 72701.258488 76387.768900 74530.805794 117330.637166 149089.027388
      std 135950.435529 220159.684405 154546.415791 239038.471726 288166.040553
      xxxxxxxxxx
      That's the information we need to standardize our data, so let's make it happen.

      That's the information we need to standardize our data, so let's make it happen.

      [61]:
      xxxxxxxxxx
       
      VimeoVideo("714611056", h="670f6bdb78", width=600)
      [61]:
      xxxxxxxxxx
      **Task 6.3.10:** Create a `StandardScaler` transformer, use it to transform the data in `X`, and then put the transformed data into a DataFrame named `X_scaled`.

      Task 6.3.10: Create a StandardScaler transformer, use it to transform the data in X, and then put the transformed data into a DataFrame named X_scaled.

      • What's standardization?
      • Transform data using a transformer in scikit-learn.WQU WorldQuant University Applied Data Science Lab QQQQ
      [65]:
      xxxxxxxxxx
       
      # Instantiate transformer
      ss = StandardScaler()
      ​
      # Transform `X`
      X_scaled_data = ss.fit_transform(X)
      ​
      # Put `X_scaled_data` into DataFrame
      X_scaled = pd.DataFrame(X_scaled_data, columns= X.columns)
      ​
      print("X_scaled shape:", X_scaled.shape)
      X_scaled.head()
      X_scaled shape: (4418, 5)
      
      [65]:
      DEBT NETWORTH HOUSES NFIN ASSET
      0 -0.445075 -0.377486 -0.48231 -0.474583 -0.498377
      1 -0.442132 -0.368401 -0.48231 -0.464541 -0.490047
      2 -0.422270 -0.383868 -0.48231 -0.467470 -0.492494
      3 -0.431097 -0.358407 -0.48231 -0.449061 -0.477206
      4 -0.421534 -0.372966 -0.48231 -0.457010 -0.483818
      xxxxxxxxxx
      As you can see, all five of the features use the same scale now. But just to make sure, let's take a look at their mean and standard deviation.

      As you can see, all five of the features use the same scale now. But just to make sure, let's take a look at their mean and standard deviation.

      [66]:
      xxxxxxxxxx
       
      VimeoVideo("714611032", h="1ed03c46eb", width=600)
      [66]:
      xxxxxxxxxx
      **Task 6.3.11:** Create a DataFrame `X_scaled_summary` with the mean and standard deviation for all the features in `X_scaled`.

      Task 6.3.11: Create a DataFrame X_scaled_summary with the mean and standard deviation for all the features in X_scaled.

      • Aggregate data in a DataFrame using one or more functions in pandas.
      [68]:
      xxxxxxxxxx
       
      X_scaled_summary = X_scaled.aggregate(["mean", "std"]).astype(int)
      X_scaled_summary
      [68]:
      DEBT NETWORTH HOUSES NFIN ASSET
      mean 0 0 0 0 0
      std 1 1 1 1 1
      xxxxxxxxxx
      And that's what it should look like. Remember, standardization takes all the features and scales them so that they all have a mean of 0 and a standard deviation of 1.

      And that's what it should look like. Remember, standardization takes all the features and scales them so that they all have a mean of 0 and a standard deviation of 1.

      xxxxxxxxxx
      Now that we can compare all our data on the same scale, we can start making clusters. Just like we did last time, we need to figure out how many clusters we should have.

      Now that we can compare all our data on the same scale, we can start making clusters. Just like we did last time, we need to figure out how many clusters we should have.

      [69]:
      xxxxxxxxxx
       
      VimeoVideo("714610976", h="82f32af967", width=600)
      [69]:
      xxxxxxxxxx
      **Task 6.3.12:** Use a `for` loop to build and train a K-Means model where `n_clusters` ranges from 2 to 12 (inclusive). Your model should include a `StandardScaler`. Each time a model is trained, calculate the inertia and add it to the list `inertia_errors`, then calculate the silhouette score and add it to the list `silhouette_scores`.

      Task 6.3.12: Use a for loop to build and train a K-Means model where n_clusters ranges from 2 to 12 (inclusive). Your model should include a StandardScaler. Each time a model is trained, calculate the inertia and add it to the list inertia_errors, then calculate the silhouette score and add it to the list silhouette_scores.

      • Write a for loop in Python.
      • Calculate the inertia for a model in scikit-learn.
      • Calculate the silhouette score for a model in scikit-learn.
      • Create a pipeline in scikit-learn.
      [73]:
      xxxxxxxxxx
       
      n_clusters = range(2, 13)
      inertia_errors = []
      silhouette_scores = []
      ​
      # Add `for` loop to train model and calculate inertia, silhouette score.
      for k in n_clusters:
          model = make_pipeline(
              StandardScaler(), 
              KMeans(n_clusters=k, random_state=42) )
          model.fit(X)
      ​
          inertia_errors.append(model.named_steps["kmeans"].inertia_)
          silhouette_scores.append(
              silhouette_score(X, model.named_steps["kmeans"].labels_)
          )
          
      ​
      print("Inertia:", inertia_errors[:3])
      print()
      print("Silhouette Scores:", silhouette_scores[:3])
      Inertia: [11028.058082607145, 7190.526303575355, 5924.997726868041]
      
      Silhouette Scores: [0.7464502937083215, 0.7044601307791996, 0.6962653079183132]
      
      xxxxxxxxxx
      Just like last time, let's create an elbow plot to see how many clusters we should use. 

      Just like last time, let's create an elbow plot to see how many clusters we should use.

      [71]:
      xxxxxxxxxx
       
      VimeoVideo("714610940", h="bacf42a282", width=600)
      [71]:
      xxxxxxxxxx
      **Task 6.3.13:** Use plotly express to create a line plot that shows the values of `inertia_errors` as a function of `n_clusters`. Be sure to label your x-axis `"Number of Clusters"`, your y-axis `"Inertia"`, and use the title `"K-Means Model: Inertia vs Number of Clusters"`.

      Task 6.3.13: Use plotly express to create a line plot that shows the values of inertia_errors as a function of n_clusters. Be sure to label your x-axis "Number of Clusters", your y-axis "Inertia", and use the title "K-Means Model: Inertia vs Number of Clusters".

      • What's a line plot?
      • Create a line plot in plotly express.
      [81]:
      xxxxxxxxxx
       
      # Create line plot of `inertia_errors` vs `n_clusters`
      fig = px.line(
          x= n_clusters, y = inertia_errors, title="K-Means Model: Inertia vs Number of Clusters"
      )
      ​
      fig,update_l
      fig.show()
      xxxxxxxxxx
      You can see that the line starts to flatten out around 4 or 5 clusters.

      You can see that the line starts to flatten out around 4 or 5 clusters.

      xxxxxxxxxx
      <div class="alert alert-block alert-info">

      Note: We ended up using 5 clusters last time, too, but that's because we're working with very similar data. 5 clusters isn't always going to be the right choice for this type of analysis, as we'll see below.

      xxxxxxxxxx
      Let's make another line plot based on the silhouette scores.

      Let's make another line plot based on the silhouette scores.

      [72]:
      xxxxxxxxxx
       
      VimeoVideo("714610912", h="01961ee57a", width=600)
      [72]:
      xxxxxxxxxx
      **Task 6.3.14:** Use plotly express to create a line plot that shows the values of `silhouette_scores` as a function of `n_clusters`. Be sure to label your x-axis `"Number of Clusters"`, your y-axis `"Silhouette Score"`, and use the title `"K-Means Model: Silhouette Score vs Number of Clusters"`.

      Task 6.3.14: Use plotly express to create a line plot that shows the values of silhouette_scores as a function of n_clusters. Be sure to label your x-axis "Number of Clusters", your y-axis "Silhouette Score", and use the title "K-Means Model: Silhouette Score vs Number of Clusters".

      • What's a line plot?
      • Create a line plot in plotly express.
      [82]:
      xxxxxxxxxx
       
      # Create a line plot of `silhouette_scores` vs `n_clusters`
      fig =  px.line(
          x= n_clusters, y = silhouette_scores, title="K-Means Model: Inertia vs Number of Clusters"
      )
      ​
      ​
      ​
      fig.show()
      xxxxxxxxxx
      This one's a little less straightforward, but we can see that the best silhouette scores occur when there are 3 or 4 clusters. 

      This one's a little less straightforward, but we can see that the best silhouette scores occur when there are 3 or 4 clusters.

      Putting the information from this plot together with our inertia plot, it seems like the best setting for n_clusters will be 4.

      [83]:
      xxxxxxxxxx
       
      VimeoVideo("714610883", h="a6a0431b02", width=600)
      [83]:
      xxxxxxxxxx
      **Task 6.3.15:** Build and train a new k-means model named `final_model`. Use the information you gained from the two plots above to set an appropriate value for the `n_clusters` argument. Once you've built and trained your model, submit it to the grader for evaluation.

      Task 6.3.15: Build and train a new k-means model named final_model. Use the information you gained from the two plots above to set an appropriate value for the n_clusters argument. Once you've built and trained your model, submit it to the grader for evaluation.

      • Create a pipeline in scikit-learn.
      • Fit a model to training data in scikit-learn.
      [10]:
      xxxxxxxxxx
       
      final_model = make_pipeline(
          StandardScaler(), 
          KMeans(n_clusters=4 ,random_state=42)
      )
      final_model.fit(X)
      [10]:
      Pipeline(steps=[('standardscaler', StandardScaler()),
                      ('kmeans', KMeans(n_clusters=4, random_state=42))])
      In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
      On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
      Pipeline(steps=[('standardscaler', StandardScaler()),
                      ('kmeans', KMeans(n_clusters=4, random_state=42))])
      StandardScaler()
      KMeans(n_clusters=4, random_state=42)
      xxxxxxxxxx
      When you're confident in your model, submit it to the grader.

      When you're confident in your model, submit it to the grader.

      [87]:
      xxxxxxxxxx
       
      ​
      wqet_grader.grade("Project 6 Assessment", "Task 6.3.14", final_model)

      Python master 😁

      Score: 1

      xxxxxxxxxx
      # Communicate

      3. Communicate¶

      xxxxxxxxxx
      It's time to let everyone know how things turned out. Let's start by grabbing the labels.

      It's time to let everyone know how things turned out. Let's start by grabbing the labels.

      [88]:
      xxxxxxxxxx
       
      VimeoVideo("714610862", h="69ff3fb2c8", width=600)
      [88]:
      xxxxxxxxxx
      **Task 6.3.16:** Extract the labels that your `final_model` created during training and assign them to the variable `labels`.

      Task 6.3.16: Extract the labels that your final_model created during training and assign them to the variable labels.

      • Access an object in a pipeline in scikit-learn.
      [11]:
      xxxxxxxxxx
       
      labels = final_model.named_steps["kmeans"].labels_
      print(labels[:5])
      [0 0 0 0 0]
      
      xxxxxxxxxx
      We're going to make a visualization, so we need to create a new DataFrame to work with.

      We're going to make a visualization, so we need to create a new DataFrame to work with.

      [90]:
      xxxxxxxxxx
       
      VimeoVideo("714610842", h="008a463aca", width=600)
      [90]:
      xxxxxxxxxx
      **Task 6.3.17:** Create a DataFrame `xgb` that contains the mean values of the features in `X` for each of the clusters in your `final_model`.

      Task 6.3.17: Create a DataFrame xgb that contains the mean values of the features in X for each of the clusters in your final_model.

      • Access an object in a pipeline in scikit-learn.
      • Aggregate data using the groupby method in pandas.
      • Create a DataFrame from a Series in pandas.
      [12]:
      xxxxxxxxxx
       
      xgb = X.groupby(labels).mean()
      xgb
      [12]:
      DEBT NETWORTH HOUSES NFIN ASSET
      0 26551.075439 13676.153182 13745.637777 2.722605e+04 4.022723e+04
      1 218112.818182 174713.441558 257403.246753 3.305884e+05 3.928263e+05
      2 116160.779817 965764.155963 264339.449541 7.800611e+05 1.081925e+06
      3 732937.575758 760397.575758 826136.363636 1.276227e+06 1.493335e+06
      xxxxxxxxxx
      Now that we have a DataFrame, let's make a bar chart and see how our clusters differ. 

      Now that we have a DataFrame, let's make a bar chart and see how our clusters differ.

      [13]:
      xxxxxxxxxx
       
      VimeoVideo("714610772", h="e118407ff1", width=600)
      [13]:
      xxxxxxxxxx
      **Task 6.3.18:** Use plotly express to create a side-by-side bar chart from `xgb` that shows the mean of the features in `X` for each of the clusters in your `final_model`. Be sure to label the x-axis `"Cluster"`, the y-axis `"Value [$]"`, and use the title `"Mean Household Finances by Cluster"`.

      Task 6.3.18: Use plotly express to create a side-by-side bar chart from xgb that shows the mean of the features in X for each of the clusters in your final_model. Be sure to label the x-axis "Cluster", the y-axis "Value [$]", and use the title "Mean Household Finances by Cluster".

      • What's a bar chart?
      • Create a bar chart using plotly express.
      [17]:
      xxxxxxxxxx
       
      # Create side-by-side bar chart of `xgb`
      fig = px.bar(
          xgb,
          barmode= "group",
          title="Mean Household Finances by Cluster"
      )
      fig.update_layout
      ​
      fig.show()
      xxxxxxxxxx
      Remember that our clusters are based partially on `NETWORTH`, which means that the households in the 0 cluster have the smallest net worth, and the households in the 2 cluster have the highest. Based on that, there are some interesting things to unpack here.

      Remember that our clusters are based partially on NETWORTH, which means that the households in the 0 cluster have the smallest net worth, and the households in the 2 cluster have the highest. Based on that, there are some interesting things to unpack here.

      First, take a look at the DEBT variable. You might think that it would scale as net worth increases, but it doesn't. The lowest amount of debt is carried by the households in cluster 2, even though the value of their houses (shown in green) is roughly the same. You can't really tell from this data what's going on, but one possibility might be that the people in cluster 2 have enough money to pay down their debts, but not quite enough money to leverage what they have into additional debts. The people in cluster 3, by contrast, might not need to worry about carrying debt because their net worth is so high.

      Finally, since we started out this project looking at home values, take a look at the relationship between DEBT and HOUSES. The value of the debt for the people in cluster 0 is higher than the value of their houses, suggesting that most of the debt being carried by those people is tied up in their mortgages — if they own a home at all. Contrast that with the other three clusters: the value of everyone else's debt is lower than the value of their homes.

      So all that's pretty interesting, but it's different from what we did last time, right? At this point in the last lesson, we made a scatter plot. This was a straightforward task because we only worked with two features, so we could plot the data points in two dimensions. But now X has five dimensions! How can we plot this to give stakeholders a sense of our clusters?

      Since we're working with a computer screen, we don't have much of a choice about the number of dimensions we can use: it's got to be two. So, if we're going to do anything like the scatter plot we made before, we'll need to take our 5-dimensional data and change it into something we can look at in 2 dimensions.

      [18]:
      xxxxxxxxxx
       
      VimeoVideo("714610665", h="19c9f7bf7f", width=600)
      [18]:
      xxxxxxxxxx
      **Task 6.3.19:** Create a `PCA` transformer, use it to reduce the dimensionality of the data in `X` to 2, and then put the transformed data into a DataFrame named `X_pca`. The columns of `X_pca` should be named `"PC1"` and `"PC2"`.

      Task 6.3.19: Create a PCA transformer, use it to reduce the dimensionality of the data in X to 2, and then put the transformed data into a DataFrame named X_pca. The columns of X_pca should be named "PC1" and "PC2".

      • What's principal component analysis (PCA)?
      • Transform data using a transformer in scikit-learn.
      [22]:
      xxxxxxxxxx
       
      # Instantiate transformer
      pca = PCA(n_components=2, random_state=42)
      ​
      # Transform `X`
      X_t = pca.fit_transform(X)
      ​
      # Put `X_t` into DataFrame
      X_pca = pd.DataFrame(X_t, columns=["PC1", "PC2"])
      ​
      print("X_pca shape:", X_pca.shape)
      X_pca.head()
      X_pca shape: (4418, 2)
      
      [22]:
      PC1 PC2
      0 -221525.424530 -22052.273003
      1 -217775.100722 -22851.358068
      2 -219519.642175 -19023.646333
      3 -212195.720367 -22957.107039
      4 -215540.507551 -20259.749306
      xxxxxxxxxx
      So there we go: our five dimensions have been reduced to two. Let's make a scatter plot and see what we get.

      So there we go: our five dimensions have been reduced to two. Let's make a scatter plot and see what we get.

      [23]:
      xxxxxxxxxx
       
      VimeoVideo("714610491", h="755c66fe15", width=600)
      [23]:
      xxxxxxxxxx
      **Task 6.3.20:** Use plotly express to create a scatter plot of `X_pca` using seaborn. Be sure to color the data points using the labels generated by your `final_model`. Label the x-axis `"PC1"`, the y-axis `"PC2"`, and use the title `"PCA Representation of Clusters"`.

      Task 6.3.20: Use plotly express to create a scatter plot of X_pca using seaborn. Be sure to color the data points using the labels generated by your final_model. Label the x-axis "PC1", the y-axis "PC2", and use the title "PCA Representation of Clusters".

      • What's a scatter plot?
      • Create a scatter plot using plotly express.
      [25]:
      xxxxxxxxxx
       
      # Create scatter plot of `PC2` vs `PC1`
      fig = px.scatter(
          data_frame= X_pca,
          x ="PC1",
          y="PC2",
          color=labels
      )
      ​
      fig.show()
      xxxxxxxxxx
      <div class="alert alert-block alert-info">
      Note: You can often improve the performance of PCA by standardizing your data first. Give it a try by including a StandardScaler in your transformation of X. How does it change the clusters in your scatter plot?
      xxxxxxxxxx
      One limitation of this plot is that it's hard to explain what the axes here represent. In fact, both of them are a combination of the five features we originally had in `X`, which means this is pretty abstract. Still, it's the best way we have to show as much information as possible as an explanatory tool for people outside the data science community. 

      One limitation of this plot is that it's hard to explain what the axes here represent. In fact, both of them are a combination of the five features we originally had in X, which means this is pretty abstract. Still, it's the best way we have to show as much information as possible as an explanatory tool for people outside the data science community.

      So what does this graph mean? It means that we made four tightly-grouped clusters that share some key features. If we were presenting this to a group of stakeholders, it might be useful to show this graph first as a kind of warm-up, since most people understand how a two-dimensional object works. Then we could move on to a more nuanced analysis of the data.

      Just something to keep in mind as you continue your data science journey.

      xxxxxxxxxx
      ---

      Copyright 2022 WorldQuant University. This content is licensed solely for personal use. Redistribution or publication of this material is strictly prohibited.

      xxxxxxxxxx
      ​

      Usage Guidelines

      This lesson is part of the DS Lab core curriculum. For that reason, this notebook can only be used on your WQU virtual machine.

      This means:

      • ⓧ No downloading this notebook.
      • ⓧ No re-sharing of this notebook with friends or colleagues.
      • ⓧ No downloading the embedded videos in this notebook.
      • ⓧ No re-sharing embedded videos with friends or colleagues.
      • ⓧ No adding this notebook to public or private repositories.
      • ⓧ No uploading this notebook (or screenshots of it) to other websites, including websites for study resources.

      xxxxxxxxxx
      <font size="+3"><strong>6.4. Interactive Dashboard</strong></font>

      6.4. Interactive Dashboard

      xxxxxxxxxx
      In the last lesson, we built a model based on the highest-variance features in our dataset and created several visualizations to communicate our results. In this lesson, we're going to combine all of these elements into a dynamic web application that will allow users to choose their own features, build a model, and evaluate its performance through a graphic user interface. In other words, you'll create a tool that will allow anyone to build a model without code. 

      In the last lesson, we built a model based on the highest-variance features in our dataset and created several visualizations to communicate our results. In this lesson, we're going to combine all of these elements into a dynamic web application that will allow users to choose their own features, build a model, and evaluate its performance through a graphic user interface. In other words, you'll create a tool that will allow anyone to build a model without code.

      xxxxxxxxxx
      <div class="alert alert-block alert-warning">

      Warning: If you have issues with your app launching during this project, try restarting your kernel and re-running the notebook from the beginning. Go to Kernel > Restart Kernel and Clear All Outputs.

      If that doesn't work, close the browser window for your virtual machine, and then relaunch it from the "Overview" section of the WQU learning platform.

      [3]:
      xxxxxxxxxx
       
      ​
      import pandas as pd
      import plotly.express as px
      import wqet_grader
      from dash import Input, Output, dcc, html
      from IPython.display import VimeoVideo
      from jupyter_dash import JupyterDash
      from scipy.stats.mstats import trimmed_var
      from sklearn.cluster import KMeans
      from sklearn.decomposition import PCA
      from sklearn.metrics import silhouette_score
      from sklearn.pipeline import make_pipeline
      from sklearn.preprocessing import StandardScaler
      ​
      wqet_grader.init("Project 6 Assessment")
      ​
      ​
      JupyterDash.infer_jupyter_proxy_config()
      [ ]:
      xxxxxxxxxx
       
      VimeoVideo("715724401", h="062cb7d8cb", width=600)
      xxxxxxxxxx
      # Prepare Data

      1. Prepare Data¶

      xxxxxxxxxx
      As always, we'll start by bringing our data into the project using a `wrangle` function.

      As always, we'll start by bringing our data into the project using a wrangle function.

      xxxxxxxxxx
      ## Import

      1.1. Import¶

      [ ]:
      xxxxxxxxxx
       
      VimeoVideo("715724313", h="711e785135", width=600)
      xxxxxxxxxx
      **Task 6.4.1:** Complete the `wrangle` function below, using the docstring as a guide. Then use your function to read the file `"data/SCFP2019.csv.gz"` into a DataFrame. 

      Task 6.4.1: Complete the wrangle function below, using the docstring as a guide. Then use your function to read the file "data/SCFP2019.csv.gz" into a DataFrame.

      [4]:
      xxxxxxxxxx
       
      def wrangle(filepath):
      ​
          """Read SCF data file into ``DataFrame``.
      ​
          Returns only credit fearful households whose net worth is less than $2 million.
      ​
          Parameters
          ----------
          filepath : str
              Location of CSV file.
          """
          df= pd.read_csv(filepath)
          mask = (df["TURNFEAR"]==1 ) & ( df["NETWORTH"]<2e6 )
          df = df[mask]
          return df
      [5]:
      xxxxxxxxxx
       
      df = wrangle("data/SCFP2019.csv.gz")
      print(df.shape)
      df.head()
      (4418, 351)
      
      [5]:
      YY1 Y1 WGT HHSEX AGE AGECL EDUC EDCL MARRIED KIDS ... NWCAT INCCAT ASSETCAT NINCCAT NINC2CAT NWPCTLECAT INCPCTLECAT NINCPCTLECAT INCQRTCAT NINCQRTCAT
      5 2 21 3790.476607 1 50 3 8 2 1 3 ... 1 2 1 2 1 1 4 4 2 2
      6 2 22 3798.868505 1 50 3 8 2 1 3 ... 1 2 1 2 1 1 4 3 2 2
      7 2 23 3799.468393 1 50 3 8 2 1 3 ... 1 2 1 2 1 1 4 4 2 2
      8 2 24 3788.076005 1 50 3 8 2 1 3 ... 1 2 1 2 1 1 4 4 2 2
      9 2 25 3793.066589 1 50 3 8 2 1 3 ... 1 2 1 2 1 1 4 4 2 2

      5 rows × 351 columns

      xxxxxxxxxx
      # Build Dashboard

      2. Build Dashboard¶

      xxxxxxxxxx
      It's app time! There are lots of steps to follow here, but, by the end, you'll have made an interactive dashboard! We'll start with the layout.

      It's app time! There are lots of steps to follow here, but, by the end, you'll have made an interactive dashboard! We'll start with the layout.

      xxxxxxxxxx
      ## Application Layout

      2.1. Application Layout¶

      xxxxxxxxxx
      First, instantiate the application.

      First, instantiate the application.

      [ ]:
      xxxxxxxxxx
       
      VimeoVideo("715724244", h="41e32f352f", width=600)
      xxxxxxxxxx
      **Task 6.4.2:** Instantiate a `JupyterDash` application and assign it to the variable name `app`.

      Task 6.4.2: Instantiate a JupyterDash application and assign it to the variable name app.

      [6]:
      xxxxxxxxxx
       
      app = JupyterDash(__name__)
      xxxxxxxxxx
      Then, let's give the app some labels.

      Then, let's give the app some labels.

      [ ]:
      xxxxxxxxxx
       
      VimeoVideo("715724173", h="21f2757631", width=600)
      xxxxxxxxxx
      **Task 6.4.3:** Start building the layout of your `app` by creating a `Div` object that has two child objects: an `H1` header that reads `"Survey of Consumer Finances"` and an `H2` header that reads `"High Variance Features"`.

      Task 6.4.3: Start building the layout of your app by creating a Div object that has two child objects: an H1 header that reads "Survey of Consumer Finances" and an H2 header that reads "High Variance Features".

      xxxxxxxxxx
      <div class="alert alert-block alert-info">
      Note: We're going to build the layout for our application iteratively. So be prepared to return to this block of code several times as we add features.
      [29]:
      xxxxxxxxxx
       
      app.layout = html.Div(
          [
              # Application title
              html.H1("survey of Consumer Finances"),
              # Bar chart element
              html.H2("High Variance Features"),
              #bar chart
              dcc.Graph(id = "bar-chart"),
              dcc.RadioItems(
                  options=[
                      {"label":"trimmed", "value":True},
                      {"label":"not-trimmed", "value":False}
                  ],
                  id="trimmed-button",
                  value=True     
              ),
              html.H2("K Means"),
              html.H3("No of K means"),
              dcc.Slider(min=2, max=12, step=1, id="k-slider"),
              html.Div(id="metrics"),
              
              # PCA graph
              dcc.Graph(id="pca-chart")
          ]
      )
      xxxxxxxxxx
      Eventually, the app we make will have several interactive parts. We'll start with a bar chart.

      Eventually, the app we make will have several interactive parts. We'll start with a bar chart.

      xxxxxxxxxx
      ## Variance Bar Chart

      2.2. Variance Bar Chart¶

      xxxxxxxxxx
      No matter how well-designed the chart might be, it won't show up in the app unless we add it to the dashboard as an object first.

      No matter how well-designed the chart might be, it won't show up in the app unless we add it to the dashboard as an object first.

      [ ]:
      xxxxxxxxxx
       
      VimeoVideo("715724086", h="e9ed963958", width=600)
      xxxxxxxxxx
      **Task 6.4.4:** Add a `Graph` object to your application's layout. Be sure to give it the id `"bar-chart"`.

      Task 6.4.4: Add a Graph object to your application's layout. Be sure to give it the id "bar-chart".

      xxxxxxxxxx
      Just like we did last time, we need to retrieve the features with the highest variance.

      Just like we did last time, we need to retrieve the features with the highest variance.

      [ ]:
      xxxxxxxxxx
       
      VimeoVideo("715724816", h="80ec24d3d6", width=600)
      xxxxxxxxxx
      **Task 6.4.5:** Create a `get_high_var_features` function that returns the five highest-variance features in a DataFrame. Use the docstring for guidance. 

      Task 6.4.5: Create a get_high_var_features function that returns the five highest-variance features in a DataFrame. Use the docstring for guidance.

      [8]:
      xxxxxxxxxx
       
      def get_high_var_features(trimmed = True, return_feat_names = False):
      ​
          """Returns the five highest-variance features of ``df``.
      ​
          Parameters
          ----------
          trimmed : bool, default=True
              If ``True``, calculates trimmed variance, removing bottom and top 10%
              of observations.
      ​
          return_feat_names : bool, default=False
              If ``True``, returns feature names as a ``list``. If ``False``
              returns ``Series``, where index is feature names and values are
              variances.
          """
          #calculate variance
          if trimmed:
              top_five_features = df.apply(trimmed_var).sort_values().tail(5)
          else:
              top_five_features = df.var().sort_values().tail(5)
          
          if return_feat_names:
              top_five_features = top_five_features.index.tolist()
          return top_five_features
      [ ]:
      xxxxxxxxxx
       
      get_high_var_features()
      xxxxxxxxxx
      Now that we have our top five features, we can use a function to return them in a bar chart.

      Now that we have our top five features, we can use a function to return them in a bar chart.

      [ ]:
      xxxxxxxxxx
       
      VimeoVideo("715724735", h="5238a5c518", width=600)
      xxxxxxxxxx
      **Task 6.4.6:** Create a `serve_bar_chart` function that returns a plotly express bar chart of the five highest-variance features. You should use `get_high_var_features` as a helper function. Follow the docstring for guidance.

      Task 6.4.6: Create a serve_bar_chart function that returns a plotly express bar chart of the five highest-variance features. You should use get_high_var_features as a helper function. Follow the docstring for guidance.

      [9]:
      xxxxxxxxxx
       
      @app.callback(
          Output("bar-chart","figure"), Input("trimmed-button", "value")
      )
      ​
      def serve_bar_chart(trimmed = True):
      ​
          """Returns a horizontal bar chart of five highest-variance features.
      ​
          Parameters
          ----------
          trimmed : bool, default=True
              If ``True``, calculates trimmed variance, removing bottom and top 10%
              of observations.
          """
          top_five_features = get_high_var_features(trimmed= trimmed, return_feat_names=False)
          
          fig = px.bar(x= top_five_features, y = top_five_features.index, orientation= "h")
          fig.update_layout(xaxis_title="Variance", yaxis_title="Features")
          return fig
      [ ]:
      xxxxxxxxxx
       
      serve_bar_chart()
      xxxxxxxxxx
      Now, add the actual chart to the app.

      Now, add the actual chart to the app.

      [ ]:
      xxxxxxxxxx
       
      VimeoVideo("715724706", h="b672dd9202", width=600)
      xxxxxxxxxx
      **Task 6.4.7:** Use your `serve_bar_chart` function to add a bar chart to `"bar-chart"`. <span style='color: transparent; font-size:1%'>WQU WorldQuant University Applied Data Science Lab QQQQ</span>

      Task 6.4.7: Use your serve_bar_chart function to add a bar chart to "bar-chart". WQU WorldQuant University Applied Data Science Lab QQQQ

      xxxxxxxxxx
      What we've done so far hasn't been all that different from other visualizations we've built in the past. Most of those charts have been static, but this one's going to be interactive. Let's add a radio button to give people something to play with.

      What we've done so far hasn't been all that different from other visualizations we've built in the past. Most of those charts have been static, but this one's going to be interactive. Let's add a radio button to give people something to play with.

      [ ]:
      xxxxxxxxxx
       
      VimeoVideo("715724662", h="957a128506", width=600)
      xxxxxxxxxx
      **Task 6.4.8:** Add a radio button to your application's layout. It should have two options: `"trimmed"` (which carries the value `True`) and `"not trimmed"` (which carries the value `False`). Be sure to give it the id `"trim-button"`.

      Task 6.4.8: Add a radio button to your application's layout. It should have two options: "trimmed" (which carries the value True) and "not trimmed" (which carries the value False). Be sure to give it the id "trim-button".

      xxxxxxxxxx
      Now that we have code to create our bar chart, a place in our app to put it, and a button to manipulate it, let's connect all three elements.

      Now that we have code to create our bar chart, a place in our app to put it, and a button to manipulate it, let's connect all three elements.

      [ ]:
      xxxxxxxxxx
       
      VimeoVideo("715724573", h="7de7932f70", width=600)
      xxxxxxxxxx
      **Task 6.4.9:** Add a callback decorator to your `serve_bar_chart` function. The callback input should be the value returned by `"trim-button"`, and the output should be directed to `"bar-chart"`.

      Task 6.4.9: Add a callback decorator to your serve_bar_chart function. The callback input should be the value returned by "trim-button", and the output should be directed to "bar-chart".

      xxxxxxxxxx
      When you're satisfied with your bar chart and radio buttons, scroll down to the bottom of this page and run the last block of code to see your work in action!

      When you're satisfied with your bar chart and radio buttons, scroll down to the bottom of this page and run the last block of code to see your work in action!

      xxxxxxxxxx
      ## K-means Slider and Metrics

      2.3. K-means Slider and Metrics¶

      xxxxxxxxxx
      Okay, so now our app has a radio button, but that's only one thing for a viewer to interact with. Buttons are fun, but what if we made a slider to help people see what it means for the number of clusters to change. Let's do it!

      Okay, so now our app has a radio button, but that's only one thing for a viewer to interact with. Buttons are fun, but what if we made a slider to help people see what it means for the number of clusters to change. Let's do it!

      Again, start by adding some objects to the layout.

      [ ]:
      xxxxxxxxxx
       
      VimeoVideo("715725482", h="88aa75b1e2", width=600)
      xxxxxxxxxx
      **Task 6.4.10:** Add two text objects to your application's layout: an `H2` header that reads `"K-means Clustering"` and an `H3` header that reads `"Number of Clusters (k)"`. 

      Task 6.4.10: Add two text objects to your application's layout: an H2 header that reads "K-means Clustering" and an H3 header that reads "Number of Clusters (k)".

      xxxxxxxxxx
      Now add the slider.

      Now add the slider.

      [ ]:
      xxxxxxxxxx
       
      VimeoVideo("715725430", h="5d24607b0c", width=600)
      xxxxxxxxxx
      **Task 6.4.11:** Add a slider to your application's layout. It should range from `2` to `12`. Be sure to give it the id `"k-slider"`.

      Task 6.4.11: Add a slider to your application's layout. It should range from 2 to 12. Be sure to give it the id "k-slider".

      xxxxxxxxxx
      And add the whole thing to the app.

      And add the whole thing to the app.

      [ ]:
      xxxxxxxxxx
       
      VimeoVideo("715725405", h="8944b9c674", width=600)
      xxxxxxxxxx
      **Task 6.4.12:** Add a `Div` object to your applications layout. Be sure to give it the id `"metrics"`.

      Task 6.4.12: Add a Div object to your applications layout. Be sure to give it the id "metrics".

      xxxxxxxxxx
      So now we have a bar chart that changes with a radio button, and a slider that changes... well, nothing yet. Let's give it a model to work with.

      So now we have a bar chart that changes with a radio button, and a slider that changes... well, nothing yet. Let's give it a model to work with.

      [ ]:
      xxxxxxxxxx
       
      VimeoVideo("715725235", h="55229ebf88", width=600)
      xxxxxxxxxx
      **Task 6.4.13:** Create a `get_model_metrics` function that builds, trains, and evaluates `KMeans` model. Use the docstring for guidance. Note that, like the model you made in the last lesson, your model here should be a pipeline that includes a `StandardScaler`. Once you're done, submit your function to the grader.

      Task 6.4.13: Create a get_model_metrics function that builds, trains, and evaluates KMeans model. Use the docstring for guidance. Note that, like the model you made in the last lesson, your model here should be a pipeline that includes a StandardScaler. Once you're done, submit your function to the grader.

      [10]:
      xxxxxxxxxx
       
      def get_model_metrics(trimmed=True, k=2, return_metrics = False):
      ​
          """Build ``KMeans`` model based on five highest-variance features in ``df``.
      ​
          Parameters
          ----------
          trimmed : bool, default=True
              If ``True``, calculates trimmed variance, removing bottom and top 10%
              of observations.
      ​
          k : int, default=2
              Number of clusters.
      ​
          return_metrics : bool, default=False
              If ``False`` returns ``KMeans`` model. If ``True`` returns ``dict``
              with inertia and silhouette score.
      ​
          """
          #get high variance feature names
          features= get_high_var_features(trimmed= trimmed, return_feat_names= True)
          #create data
          X= df[features]
          #model 
          model= make_pipeline(StandardScaler(), KMeans(n_clusters=k,random_state=42))
          model.fit(X)
          
          if return_metrics:
              #calculate inertia_
              i = model.named_steps["kmeans"].inertia_
              #calculate silhouette
              ss= silhouette_score(X, model.named_steps["kmeans"].labels_)
              
              #put result into dictionary
              metrics={
                  "inertia":round(i),
                  "silhouette":round(ss,3)
              }
              return metrics
          return model
      [11]:
      xxxxxxxxxx
       
      ​
      wqet_grader.grade("Project 6 Assessment", "Task 6.4.13", get_model_metrics())

      Good work!

      Score: 1

      xxxxxxxxxx
      Part of what we want people to be able to do with the dashboard is see how the model's inertia and silhouette score when they move the slider around, so let's calculate those numbers...

      Part of what we want people to be able to do with the dashboard is see how the model's inertia and silhouette score when they move the slider around, so let's calculate those numbers...

      [ ]:
      xxxxxxxxxx
       
      VimeoVideo("715725137", h="124312b155", width=600)
      xxxxxxxxxx
      **Task 6.4.14:** Create a `serve_metrics` function. It should use your `get_model_metrics` to build and get the metrics for a model, and then return two objects: An `H3` header with the model's inertia and another `H3` header with the silhouette score.

      Task 6.4.14: Create a serve_metrics function. It should use your get_model_metrics to build and get the metrics for a model, and then return two objects: An H3 header with the model's inertia and another H3 header with the silhouette score.

      [12]:
      xxxxxxxxxx
       
      @app.callback(
          Output("metrics", "children"),
          Input("trimmed-button", "value"), 
          Input("k-slider", "value")
      )
      def serve_metrics(trimmed= True, k = 2):
      ​
          """Returns list of ``H3`` elements containing inertia and silhouette score
          for ``KMeans`` model.
      ​
          Parameters
          ----------
          trimmed : bool, default=True
              If ``True``, calculates trimmed variance, removing bottom and top 10%
              of observations.
      ​
          k : int, default=2
              Number of clusters.
          """
      #     Get metrices
          metrics = get_model_metrics(trimmed= trimmed, k =k, return_metrics=True)
          text= [
              html.H3(f"Inertia :{metrics['inertia']}"), 
              html.H3(f"Silhouette Score: {metrics['silhouette']}")
          ]
          return text
      xxxxxxxxxx
      ... and add them to the app.

      ... and add them to the app.

      [ ]:
      xxxxxxxxxx
       
      VimeoVideo("715726075", h="ee0510063c", width=600)
      [ ]:
      xxxxxxxxxx
       
      serve_metrics()
      xxxxxxxxxx
      **Task 6.4.15:** Add a callback decorator to your `serve_metrics` function. The callback inputs should be the values returned by `"trim-button"` and `"k-slider"`, and the output should be directed to `"metrics"`.

      Task 6.4.15: Add a callback decorator to your serve_metrics function. The callback inputs should be the values returned by "trim-button" and "k-slider", and the output should be directed to "metrics".

      xxxxxxxxxx
      ## PCA Scatter Plot

      2.4. PCA Scatter Plot¶

      xxxxxxxxxx
      We just made a slider that can change the inertia and silhouette scores, but not everyone will be able to understand what those changing numbers mean. Let's make a scatter plot to help them along.

      We just made a slider that can change the inertia and silhouette scores, but not everyone will be able to understand what those changing numbers mean. Let's make a scatter plot to help them along.

      [13]:
      xxxxxxxxxx
       
      VimeoVideo("715726033", h="a658095771", width=600)
      [13]:
      xxxxxxxxxx
      **Task 6.4.16:** Add a `Graph` object to your application's layout. Be sure to give it the id `"pca-scatter"`.

      Task 6.4.16: Add a Graph object to your application's layout. Be sure to give it the id "pca-scatter".

      xxxxxxxxxx
      Just like with the bar chart, we need to get the five highest-variance features of the data, so let's start with that.

      Just like with the bar chart, we need to get the five highest-variance features of the data, so let's start with that.

      [14]:
      xxxxxxxxxx
       
      VimeoVideo("715725930", h="f957d27741", width=600)
      [14]:
      xxxxxxxxxx
      **Task 6.4.17:** Create a function `get_pca_labels` that subsets a DataFrame to its five highest-variance features, reduces those features to two dimensions using `PCA`, and returns a new DataFrame with three columns: `"PC1"`, `"PC2"`, and `"labels"`. This last column should be the labels determined by a `KMeans` model. Your function should you `get_high_var_features` and `get_model_metrics` as helpers. Refer to the docstring for guidance. 

      Task 6.4.17: Create a function get_pca_labels that subsets a DataFrame to its five highest-variance features, reduces those features to two dimensions using PCA, and returns a new DataFrame with three columns: "PC1", "PC2", and "labels". This last column should be the labels determined by a KMeans model. Your function should you get_high_var_features and get_model_metrics as helpers. Refer to the docstring for guidance.

      [16]:
      xxxxxxxxxx
       
      def get_pca_labels(trimmed=True, k=2):
      ​
          """
          ``KMeans`` labels.
      ​
          Parameters
          ----------
          trimmed : bool, default=True
              If ``True``, calculates trimmed variance, removing bottom and top 10%
              of observations.
      ​
          k : int, default=2
              Number of clusters.
          """
          features=get_high_var_features(trimmed=trimmed,return_feat_names=True)
          X= df[features]
          
          transformer = PCA(n_components=2, random_state=42)
          X_t= transformer.fit_transform(X)
          X_pca= pd.DataFrame(X_t, columns=["PC1","PC2"])
          
          model= get_model_metrics(trimmed= trimmed, k=k,return_metrics=False)
          X_pca["labels"] = model.named_steps["kmeans"].labels_.astype(str)
          X_pca.sort_values("labels", inplace= True)
          return X_pca
      [17]:
      xxxxxxxxxx
       
      get_pca_labels()
      [17]:
      PC1 PC2 labels
      2208 889749.557584 467355.407904 0
      1056 649765.113978 174994.130637 0
      1057 649536.017166 176269.044416 0
      1058 649536.017166 176269.044416 0
      1059 649765.113978 174994.130637 0
      ... ... ... ...
      1570 -229796.419844 -14301.836873 1
      1571 -229805.583716 -14250.840322 1
      1572 -229814.747589 -14199.843771 1
      1611 -213724.571420 -39060.460885 1
      4417 334191.956229 -186450.064242 1

      4418 rows × 3 columns

      xxxxxxxxxx
      Now we can use those five features to make the actual scatter plot.

      Now we can use those five features to make the actual scatter plot.

      [18]:
      xxxxxxxxxx
       
      VimeoVideo("715725877", h="21365c862f", width=600)
      [18]:
      xxxxxxxxxx
      **Task 6.4.18:** Create a function `serve_scatter_plot` that creates a 2D scatter plot of the data used to train a `KMeans` model, along with color-coded clusters. Use `get_pca_labels` as a helper. Refer to the docstring for guidance. 

      Task 6.4.18: Create a function serve_scatter_plot that creates a 2D scatter plot of the data used to train a KMeans model, along with color-coded clusters. Use get_pca_labels as a helper. Refer to the docstring for guidance.

      [30]:
      xxxxxxxxxx
       
      @app.callback(
          Output("pca-chart", "figure"),
          Input("trimmed-button","value"),
          Input("k-slider","value")
      )
      def serve_scatter_plot(trimmed=True, k=2):
      ​
          """Build 2D scatter plot of ``df`` with ``KMeans`` labels.
      ​
          Parameters
          ----------
          trimmed : bool, default=True
              If ``True``, calculates trimmed variance, removing bottom and top 10%
              of observations.
      ​
          k : int, default=2
              Number of clusters.
          """
          fig = px.scatter(
              data_frame = get_pca_labels(trimmed= trimmed, k=k),
              x="PC1",
              y="PC2",
              color="labels",
              title = "PCA representation of clusters"
          )
          fig.update_layout(xaxis_title="PC1")
          
          return fig
      xxxxxxxxxx
      Again, we finish up by adding some code to make the interactive elements of our app actually work.

      Again, we finish up by adding some code to make the interactive elements of our app actually work.

      [25]:
      xxxxxxxxxx
       
      VimeoVideo("715725777", h="4b3ecacb85", width=600)
      [25]:
      xxxxxxxxxx
      **Task 6.4.19:** Add a callback decorator to your `serve_scatter_plot` function. The callback inputs should be the values returned by `"trim-button"` and `"k-slider"`, and the output should be directed to `"pca-scatter"`.

      Task 6.4.19: Add a callback decorator to your serve_scatter_plot function. The callback inputs should be the values returned by "trim-button" and "k-slider", and the output should be directed to "pca-scatter".

      xxxxxxxxxx
      ## Application Deployment

      2.5. Application Deployment¶

      xxxxxxxxxx
      Once you're feeling good about all the work we just did, run the cell and watch the app come to life! 

      Once you're feeling good about all the work we just did, run the cell and watch the app come to life!

      xxxxxxxxxx
      **Task 6.4.20:** Run the cell below to deploy your application. 😎

      Task 6.4.20: Run the cell below to deploy your application. 😎

      xxxxxxxxxx
      <div class="alert alert-block alert-info">
      Note: We're going to build the layout for our application iteratively. So even though this is the last task, you'll run this cell multiple times as you add features to your application.
      xxxxxxxxxx
      <div class="alert alert-block alert-warning">

      Warning: If you have issues with your app launching during this project, try restarting your kernel and re-running the notebook from the beginning. Go to Kernel > Restart Kernel and Clear All Outputs.

      If that doesn't work, close the browser window for your virtual machine, and then relaunch it from the "Overview" section of the WQU learning platform.

      [28]:
      xxxxxxxxxx
       
      app.run_server(host="0.0.0.0", mode="external")
      Dash app running on https://vm.wqu.edu/proxy/8050/
      
      xxxxxxxxxx
      ---

      Copyright 2022 WorldQuant University. This content is licensed solely for personal use. Redistribution or publication of this material is strictly prohibited.

      xxxxxxxxxx
      ​

      Usage Guidelines

      This lesson is part of the DS Lab core curriculum. For that reason, this notebook can only be used on your WQU virtual machine.

      This means:

      • ⓧ No downloading this notebook.
      • ⓧ No re-sharing of this notebook with friends or colleagues.
      • ⓧ No downloading the embedded videos in this notebook.
      • ⓧ No re-sharing embedded videos with friends or colleagues.
      • ⓧ No adding this notebook to public or private repositories.
      • ⓧ No uploading this notebook (or screenshots of it) to other websites, including websites for study resources.

      xxxxxxxxxx
      <font size="+3"><strong>6.5. Small Business Owners in the United States🇺🇸</strong></font>

      6.5. Small Business Owners in the United States🇺🇸

      xxxxxxxxxx
      In this assignment, you're going to focus on business owners in the United States. You'll start by examining some demographic characteristics of the group, such as age, income category, and debt vs home value. Then you'll select high-variance features, and create a clustering model to divide small business owners into subgroups. Finally, you'll create some visualizations to highlight the differences between these subgroups. Good luck! 🍀

      In this assignment, you're going to focus on business owners in the United States. You'll start by examining some demographic characteristics of the group, such as age, income category, and debt vs home value. Then you'll select high-variance features, and create a clustering model to divide small business owners into subgroups. Finally, you'll create some visualizations to highlight the differences between these subgroups. Good luck! 🍀

      [1]:
      wqet_grader.init("Project 6 Assessment")
      [2]:
      from sklearn.preprocessing import StandardScaler
      xxxxxxxxxx
      # Prepare Data

      Prepare Data¶

      xxxxxxxxxx
      ## Import

      Import¶

      xxxxxxxxxx
      Let's start by bringing our data into the assignment.

      Let's start by bringing our data into the assignment.

      xxxxxxxxxx
      **Task 6.5.1:** Read the file `"data/SCFP2019.csv.gz"` into the DataFrame `df`.

      Task 6.5.1: Read the file "data/SCFP2019.csv.gz" into the DataFrame df.

      [3]:
      df = pd.read_csv("data/SCFP2019.csv.gz")
      df shape: (28885, 351)
      
      [3]:
      YY1 Y1 WGT HHSEX AGE AGECL EDUC EDCL MARRIED KIDS ... NWCAT INCCAT ASSETCAT NINCCAT NINC2CAT NWPCTLECAT INCPCTLECAT NINCPCTLECAT INCQRTCAT NINCQRTCAT
      0 1 11 6119.779308 2 75 6 12 4 2 0 ... 5 3 6 3 2 10 6 6 3 3
      1 1 12 4712.374912 2 75 6 12 4 2 0 ... 5 3 6 3 1 10 5 5 2 2
      2 1 13 5145.224455 2 75 6 12 4 2 0 ... 5 3 6 3 1 10 5 5 2 2
      3 1 14 5297.663412 2 75 6 12 4 2 0 ... 5 2 6 2 1 10 4 4 2 2
      4 1 15 4761.812371 2 75 6 12 4 2 0 ... 5 3 6 3 1 10 5 5 2 2

      5 rows × 351 columns

      [4]:
      wqet_grader.grade("Project 6 Assessment", "Task 6.5.1", list(df.shape))

      You got it. Dance party time! 🕺💃🕺💃

      Score: 1

      xxxxxxxxxx
      ## Explore

      Explore¶

      xxxxxxxxxx
      As mentioned at the start of this assignment, you're focusing on business owners. But what percentage of the respondents in `df` are business owners?

      As mentioned at the start of this assignment, you're focusing on business owners. But what percentage of the respondents in df are business owners?

      xxxxxxxxxx
      **Task 6.5.2:** Calculate the proportion of respondents in `df` that are business owners, and assign the result to the variable `pct_biz_owners`. You'll need to review the documentation regarding the `"HBUS"` column to complete these tasks.

      Task 6.5.2: Calculate the proportion of respondents in df that are business owners, and assign the result to the variable pct_biz_owners. You'll need to review the documentation regarding the "HBUS" column to complete these tasks.

      [5]:
      print("proportion of business owners in df:", prop_biz_owners)
      proportion of business owners in df: 0.2740176562229531
      
      [6]:
      wqet_grader.grade("Project 6 Assessment", "Task 6.5.2", [prop_biz_owners])

      Python master 😁

      Score: 1

      xxxxxxxxxx
      Is the distribution of income different for business owners and non-business owners?

      Is the distribution of income different for business owners and non-business owners?

      xxxxxxxxxx
      **Task 6.5.3:** Create a DataFrame `df_inccat` that shows the normalized frequency for income categories for business owners and non-business owners. Your final DataFrame should look something like this:

      Task 6.5.3: Create a DataFrame df_inccat that shows the normalized frequency for income categories for business owners and non-business owners. Your final DataFrame should look something like this:

          HBUS   INCCAT  frequency
      0      0     0-20   0.210348
      1      0  21-39.9   0.198140
      ...
      11     1     0-20   0.041188
      
      [7]:
          df["INCCAT"].replace(inccat_dict)
      [7]:
      HBUS INCCAT frequency
      0 0 0-20 0.210348
      1 0 21-39.9 0.198140
      2 0 40-59.9 0.189080
      3 0 60-79.9 0.186600
      4 0 90-100 0.117167
      5 0 80-89.9 0.098665
      6 1 90-100 0.629438
      7 1 60-79.9 0.119015
      8 1 80-89.9 0.097410
      9 1 40-59.9 0.071510
      10 1 21-39.9 0.041440
      11 1 0-20 0.041188
      [8]:
      wqet_grader.grade("Project 6 Assessment", "Task 6.5.3", df_inccat)

      Yes! Your hard work is paying off.

      Score: 1

      xxxxxxxxxx
      **Task 6.5.4:** Using seaborn, create a side-by-side bar chart of `df_inccat`. Set `hue` to `"HBUS"`, and make sure that the income categories are in the correct order along the x-axis. Label to the x-axis `"Income Category"`, the y-axis `"Frequency (%)"`, and use the title `"Income Distribution: Business Owners vs. Non-Business Owners"`.

      Task 6.5.4: Using seaborn, create a side-by-side bar chart of df_inccat. Set hue to "HBUS", and make sure that the income categories are in the correct order along the x-axis. Label to the x-axis "Income Category", the y-axis "Frequency (%)", and use the title "Income Distribution: Business Owners vs. Non-Business Owners".

      [9]:
      sns.barplot(x="INCCAT",y="frequency", hue="HBUS", data =df_inccat, order = inccat_dict.values())
      [10]:
          wqet_grader.grade("Project 6 Assessment", "Task 6.5.4", file)

      That's the right answer. Keep it up!

      Score: 1

      xxxxxxxxxx
      We looked at the relationship between home value and household debt in the context of the the credit fearful, but what about business owners? Are there notable differences between business owners and non-business owners?

      We looked at the relationship between home value and household debt in the context of the the credit fearful, but what about business owners? Are there notable differences between business owners and non-business owners?

      xxxxxxxxxx
      **Task 6.5.5:** Using seaborn, create a scatter plot that shows `"HOUSES"` vs. `"DEBT"`. You should color the datapoints according to business ownership. Be sure to label the x-axis `"Household Debt"`, the y-axis `"Home Value"`, and use the title `"Home Value vs. Household Debt"`. 

      Task 6.5.5: Using seaborn, create a scatter plot that shows "HOUSES" vs. "DEBT". You should color the datapoints according to business ownership. Be sure to label the x-axis "Household Debt", the y-axis "Home Value", and use the title "Home Value vs. Household Debt".

      [11]:
      sns.scatterplot(x=df["HOUSES"], y= df["DEBT"], hue=df["HBUS"])
      xxxxxxxxxx
      For the model building part of the assignment, you're going to focus on small business owners, defined as respondents who have a business and whose income does not exceed \\$500,000.

      For the model building part of the assignment, you're going to focus on small business owners, defined as respondents who have a business and whose income does not exceed \$500,000.

      [12]:
          wqet_grader.grade("Project 6 Assessment", "Task 6.5.5", file)

      Yes! Your hard work is paying off.

      Score: 1

      xxxxxxxxxx
      **Task 6.5.6:** Create a new DataFrame `df_small_biz` that contains only business owners whose income is below \\$500,000.

      Task 6.5.6: Create a new DataFrame df_small_biz that contains only business owners whose income is below \$500,000.

      [13]:
      mask = (df["HBUS"] == 1) & (df["INCOME"] < 500_000 )
      df_small_biz shape: (4364, 351)
      
      [13]:
      YY1 Y1 WGT HHSEX AGE AGECL EDUC EDCL MARRIED KIDS ... NWCAT INCCAT ASSETCAT NINCCAT NINC2CAT NWPCTLECAT INCPCTLECAT NINCPCTLECAT INCQRTCAT NINCQRTCAT
      80 17 171 7802.265717 1 62 4 12 4 1 0 ... 3 5 5 5 2 7 9 9 4 4
      81 17 172 8247.536301 1 62 4 12 4 1 0 ... 3 5 5 5 2 7 9 9 4 4
      82 17 173 8169.562719 1 62 4 12 4 1 0 ... 3 5 5 5 2 7 9 9 4 4
      83 17 174 8087.704517 1 62 4 12 4 1 0 ... 3 5 5 5 2 7 9 9 4 4
      84 17 175 8276.510048 1 62 4 12 4 1 0 ... 3 5 5 5 2 7 9 9 4 4

      5 rows × 351 columns

      [14]:
      wqet_grader.grade("Project 6 Assessment", "Task 6.5.6", list(df_small_biz.shape))

      Yes! Keep on rockin'. 🎸That's right.

      Score: 1

      xxxxxxxxxx
      We saw that credit-fearful respondents were relatively young. Is the same true for small business owners?

      We saw that credit-fearful respondents were relatively young. Is the same true for small business owners?

      xxxxxxxxxx
      **Task 6.5.7:** Create a histogram from the `"AGE"` column in `df_small_biz` with 10 bins. Be sure to label the x-axis `"Age"`, the y-axis `"Frequency (count)"`, and use the title `"Small Business Owners: Age Distribution"`. 

      Task 6.5.7: Create a histogram from the "AGE" column in df_small_biz with 10 bins. Be sure to label the x-axis "Age", the y-axis "Frequency (count)", and use the title "Small Business Owners: Age Distribution".

      [15]:
      plt.title("Small Business Owners: Age Distribution")
      xxxxxxxxxx
      So, can we say the same thing about small business owners as we can about credit-fearful people?

      So, can we say the same thing about small business owners as we can about credit-fearful people?

      [16]:
          wqet_grader.grade("Project 6 Assessment", "Task 6.5.7", file)

      Your submission doesn't match the expected result. Check the image below to see where your plot differs from the answer.

      Score: 0

      xxxxxxxxxx
      Let's take a look at the variance in the dataset.

      Let's take a look at the variance in the dataset.

      xxxxxxxxxx
      **Task 6.5.8:** Calculate the variance for all the features in `df_small_biz`, and create a Series `top_ten_var` with the 10 features with the largest variance.

      Task 6.5.8: Calculate the variance for all the features in df_small_biz, and create a Series top_ten_var with the 10 features with the largest variance.

      [17]:
      top_ten_var = df_small_biz.var().sort_values().tail(10)
      [17]:
      EQUITY      1.005088e+13
      FIN         2.103228e+13
      KGBUS       5.025210e+13
      ACTBUS      5.405021e+13
      BUS         5.606717e+13
      KGTOTAL     6.120760e+13
      NHNFIN      7.363197e+13
      NFIN        9.244074e+13
      NETWORTH    1.424450e+14
      ASSET       1.520071e+14
      dtype: float64
      [18]:
      wqet_grader.grade("Project 6 Assessment", "Task 6.5.8", top_ten_var)

      Party time! 🎉🎉🎉

      Score: 1

      xxxxxxxxxx
      We'll need to remove some outliers to avoid problems in our calculations, so let's trim them out.

      We'll need to remove some outliers to avoid problems in our calculations, so let's trim them out.

      xxxxxxxxxx
      **Task 6.5.9:** Calculate the trimmed variance for the features in `df_small_biz`. Your calculations should not include the top and bottom 10% of observations. Then create a Series `top_ten_trim_var` with the 10 features with the largest variance.

      Task 6.5.9: Calculate the trimmed variance for the features in df_small_biz. Your calculations should not include the top and bottom 10% of observations. Then create a Series top_ten_trim_var with the 10 features with the largest variance.

      [19]:
      top_ten_trim_var = df_small_biz.apply(trimmed_var, limits=(0.1,0.1)).sort_values().tail(10)
      [19]:
      EQUITY      1.177020e+11
      KGBUS       1.838163e+11
      FIN         3.588855e+11
      KGTOTAL     5.367878e+11
      ACTBUS      5.441806e+11
      BUS         6.531708e+11
      NHNFIN      1.109187e+12
      NFIN        1.792707e+12
      NETWORTH    3.726356e+12
      ASSET       3.990101e+12
      dtype: float64
      [20]:
      wqet_grader.grade("Project 6 Assessment", "Task 6.5.9", top_ten_trim_var)

      Very impressive.

      Score: 1

      xxxxxxxxxx
      Let's do a quick visualization of those values.

      Let's do a quick visualization of those values.

      xxxxxxxxxx
      **Task 6.5.10:** Use plotly express to create a horizontal bar chart of `top_ten_trim_var`. Be sure to label your x-axis `"Trimmed Variance [$]"`, the y-axis `"Feature"`, and use the title `"Small Business Owners: High Variance Features"`.

      Task 6.5.10: Use plotly express to create a horizontal bar chart of top_ten_trim_var. Be sure to label your x-axis "Trimmed Variance [$]", the y-axis "Feature", and use the title "Small Business Owners: High Variance Features".

      [21]:
      fig = px.bar(x= top_ten_trim_var , y = top_ten_trim_var.index, title="Small Business Owners: High Variance Features")
      [22]:
          wqet_grader.grade("Project 6 Assessment", "Task 6.5.10", file)

      Python master 😁

      Score: 1

      xxxxxxxxxx
      Based on this graph, which five features have the highest variance?

      Based on this graph, which five features have the highest variance?

      xxxxxxxxxx
      **Task 6.5.11:** Generate a list `high_var_cols` with the column names of the  five features with the highest trimmed variance.

      Task 6.5.11: Generate a list high_var_cols with the column names of the five features with the highest trimmed variance.

      [23]:
      high_var_cols = top_ten_trim_var.tail(5).index.to_list()
      [23]:
      ['BUS', 'NHNFIN', 'NFIN', 'NETWORTH', 'ASSET']
      [24]:
      wqet_grader.grade("Project 6 Assessment", "Task 6.5.11", high_var_cols)

      Yes! Your hard work is paying off.

      Score: 1

      xxxxxxxxxx
      ## Split

      Split¶

      xxxxxxxxxx
      Let's turn that list into a feature matrix.

      Let's turn that list into a feature matrix.

      xxxxxxxxxx
      **Task 6.5.12:** Create the feature matrix `X`. It should contain the five columns in `high_var_cols`.

      Task 6.5.12: Create the feature matrix X. It should contain the five columns in high_var_cols.

      [25]:
      X = df_small_biz[high_var_cols]
      X shape: (4364, 5)
      
      [26]:
      wqet_grader.grade("Project 6 Assessment", "Task 6.5.12", list(X.shape))

      You're making this look easy. 😉

      Score: 1

      xxxxxxxxxx
      # Build Model

      Build Model¶

      xxxxxxxxxx
      Now that our data is in order, let's get to work on the model.

      Now that our data is in order, let's get to work on the model.

      xxxxxxxxxx
      ## Iterate

      Iterate¶

      xxxxxxxxxx
      **Task 6.5.13:** Use a `for` loop to build and train a K-Means model where `n_clusters` ranges from 2 to 12 (inclusive). Your model should include a `StandardScaler`. Each time a model is trained, calculate the inertia and add it to the list `inertia_errors`, then calculate the silhouette score and add it to the list `silhouette_scores`.

      Task 6.5.13: Use a for loop to build and train a K-Means model where n_clusters ranges from 2 to 12 (inclusive). Your model should include a StandardScaler. Each time a model is trained, calculate the inertia and add it to the list inertia_errors, then calculate the silhouette score and add it to the list silhouette_scores.

      xxxxxxxxxx
      <div class="alert alert-info" role="alert">
      Note: For reproducibility, make sure you set the random state for your model to 42.
      [27]:
      # Add `for` loop to train model and calculate inertia, silhouette score.
      Inertia: [5765.863949365048, 3070.4294488357455, 2220.292185089684, 1777.4635570665569, 1443.7860071034045, 1173.3701169574997, 1004.0082329287382, 892.7197264630449, 780.7646441851751, 678.9317940468646, 601.0107062352758]
      
      Silhouette Scores: [0.9542706303253067, 0.8446503900103915, 0.7422220122162623]
      
      [28]:
      wqet_grader.grade("Project 6 Assessment", "Task 6.5.13", list(inertia_errors))

      Very impressive.

      Score: 1

      xxxxxxxxxx
      Just like we did in the previous module, we can start to figure out how many clusters we'll need with a line plot based on Inertia.

      Just like we did in the previous module, we can start to figure out how many clusters we'll need with a line plot based on Inertia.

      xxxxxxxxxx
      **Task 6.5.14:** Use plotly express to create a line plot that shows the values of `inertia_errors` as a function of `n_clusters`. Be sure to label your x-axis `"Number of Clusters"`, your y-axis `"Inertia"`, and use the title `"K-Means Model: Inertia vs Number of Clusters"`.

      Task 6.5.14: Use plotly express to create a line plot that shows the values of inertia_errors as a function of n_clusters. Be sure to label your x-axis "Number of Clusters", your y-axis "Inertia", and use the title "K-Means Model: Inertia vs Number of Clusters".

      [31]:
      fig = px.line(y=inertia_errors,x= n_clusters,title="K-Means Model: Inertia vs Number of Clusters")
      [32]:
          wqet_grader.grade("Project 6 Assessment", "Task 6.5.14", file)

      Yes! Keep on rockin'. 🎸That's right.

      Score: 1

      xxxxxxxxxx
      And let's do the same thing with our Silhouette Scores.

      And let's do the same thing with our Silhouette Scores.

      xxxxxxxxxx
      **Task 6.5.15:** Use plotly express to create a line plot that shows the values of `silhouette_scores` as a function of `n_clusters`. Be sure to label your x-axis `"Number of Clusters"`, your y-axis `"Silhouette Score"`, and use the title `"K-Means Model: Silhouette Score vs Number of Clusters"`.

      Task 6.5.15: Use plotly express to create a line plot that shows the values of silhouette_scores as a function of n_clusters. Be sure to label your x-axis "Number of Clusters", your y-axis "Silhouette Score", and use the title "K-Means Model: Silhouette Score vs Number of Clusters".

      [33]:
      fig = px.line(y= silhouette_scores,x= n_clusters,title="K-Means Model: Silhouette Score vs Number of Clusters")
      [34]:
          wqet_grader.grade("Project 6 Assessment", "Task 6.5.15", file)

      🥳

      Score: 1

      xxxxxxxxxx
      How many clusters should we use? When you've made a decision about that, it's time to build the final model.

      How many clusters should we use? When you've made a decision about that, it's time to build the final model.

      xxxxxxxxxx
      **Task 6.5.16:** Build and train a new k-means model named `final_model`. The number of clusters should be `3`.

      Task 6.5.16: Build and train a new k-means model named final_model. The number of clusters should be 3.

      xxxxxxxxxx
      <div class="alert alert-info" role="alert">
      Note: For reproducibility, make sure you set the random state for your model to 42.
      [35]:
          KMeans(n_clusters=3, random_state=42)
      [35]:
      Pipeline(steps=[('standardscaler', StandardScaler()),
                      ('kmeans', KMeans(n_clusters=3, random_state=42))])
      In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
      On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
      Pipeline(steps=[('standardscaler', StandardScaler()),
                      ('kmeans', KMeans(n_clusters=3, random_state=42))])
      StandardScaler()
      KMeans(n_clusters=3, random_state=42)
      [36]:
      # match_steps, match_hyperparameters, prune_hyperparameters should all be True

      Correct.

      Score: 1

      xxxxxxxxxx
      # Communicate

      Communicate¶

      xxxxxxxxxx
      Excellent! Let's share our work! 

      Excellent! Let's share our work!

      xxxxxxxxxx
      **Task 6.5.17:** Create a DataFrame `xgb` that contains the mean values of the features in `X` for the 3 clusters in your `final_model`.

      Task 6.5.17: Create a DataFrame xgb that contains the mean values of the features in X for the 3 clusters in your final_model.

      [95]:
      labels = final_model.named_steps["kmeans"].labels_
      [95]:
      BUS NHNFIN NFIN NETWORTH ASSET
      0 736718 1002199 1487967 2076002 2281249
      1 68744792 82021152 91696521 113484264 116752862
      2 12161517 15676186 18291227 23100241 24226024
      [50]:
      wqet_grader.grade("Project 6 Assessment", "Task 6.5.17", xgb)

      Wow, you're making great progress.

      Score: 1

      xxxxxxxxxx
      As usual, let's make a visualization with the DataFrame.

      As usual, let's make a visualization with the DataFrame.

      xxxxxxxxxx
      **Task 6.5.18:** Use plotly express to create a side-by-side bar chart from `xgb` that shows the mean of the features in `X` for each of the clusters in your `final_model`. Be sure to label the x-axis `"Cluster"`, the y-axis `"Value [$]"`, and use the title `"Small Business Owner Finances by Cluster"`.

      Task 6.5.18: Use plotly express to create a side-by-side bar chart from xgb that shows the mean of the features in X for each of the clusters in your final_model. Be sure to label the x-axis "Cluster", the y-axis "Value [$]", and use the title "Small Business Owner Finances by Cluster".

      [101]:
      fig.write_image("images/6-5-18.png", scale=1, height=500, width=700)
      [99]:
          wqet_grader.grade("Project 6 Assessment", "Task 6.5.18", file)

      Python master 😁

      Score: 1

      xxxxxxxxxx
      Remember what we did with higher-dimension data last time? Let's do the same thing here.

      Remember what we did with higher-dimension data last time? Let's do the same thing here.

      xxxxxxxxxx
      **Task 6.5.19:** Create a `PCA` transformer, use it to reduce the dimensionality of the data in `X` to 2, and then put the transformed data into a DataFrame named `X_pca`. The columns of `X_pca` should be named `"PC1"` and `"PC2"`.

      Task 6.5.19: Create a PCA transformer, use it to reduce the dimensionality of the data in X to 2, and then put the transformed data into a DataFrame named X_pca. The columns of X_pca should be named "PC1" and "PC2".

      [102]:
      X_pca = pd.DataFrame(X_t, columns=["PC1", "PC2"])
      X_pca shape: (4364, 2)
      
      [102]:
      PC1 PC2
      0 -6.220648e+06 -503841.638840
      1 -6.222523e+06 -503941.888901
      2 -6.220648e+06 -503841.638839
      3 -6.224927e+06 -504491.429465
      4 -6.221994e+06 -503492.598399
      [103]:
      wqet_grader.grade("Project 6 Assessment", "Task 6.5.19", X_pca)

      That's the right answer. Keep it up!

      Score: 1

      xxxxxxxxxx
      Finally, let's make a visualization of our final DataFrame.<span style='color: transparent; font-size:1%'>WQU WorldQuant University Applied Data Science Lab QQQQ</span>

      Finally, let's make a visualization of our final DataFrame.WQU WorldQuant University Applied Data Science Lab QQQQ

      xxxxxxxxxx
      **Task 6.5.20:** Use plotly express to create a scatter plot of `X_pca` using seaborn. Be sure to color the data points using the labels generated by your `final_model`. Label the x-axis `"PC1"`, the y-axis `"PC2"`, and use the title `"PCA Representation of Clusters"`.

      Task 6.5.20: Use plotly express to create a scatter plot of X_pca using seaborn. Be sure to color the data points using the labels generated by your final_model. Label the x-axis "PC1", the y-axis "PC2", and use the title "PCA Representation of Clusters".

      [ ]:
      fig.write_image("images/6-5-20.png", scale=1, height=500, width=700)
      [ ]:
          wqet_grader.grade("Project 6 Assessment", "Task 6.5.20", file)
      xxxxxxxxxx
      ---

      Copyright 2022 WorldQuant University. This content is licensed solely for personal use. Redistribution or publication of this material is strictly prohibited.

      xxxxxxxxxx
      ​

      Usage Guidelines

      This lesson is part of the DS Lab core curriculum. For that reason, this notebook can only be used on your WQU virtual machine.

      This means:

      • ⓧ No downloading this notebook.
      • ⓧ No re-sharing of this notebook with friends or colleagues.
      • ⓧ No downloading the embedded videos in this notebook.
      • ⓧ No re-sharing embedded videos with friends or colleagues.
      • ⓧ No adding this notebook to public or private repositories.
      • ⓧ No uploading this notebook (or screenshots of it) to other websites, including websites for study resources.

      xxxxxxxxxx
      <font size="+3"><strong>6.6. Data Dictionary</strong></font>

      6.6. Data Dictionary

      xxxxxxxxxx
      # About the Survey of Consumer Finances

      1. About the Survey of Consumer Finances¶

      xxxxxxxxxx
      From the [US Federal Reserve](https://www.federalreserve.gov/econres/aboutscf.htm) website:

      From the US Federal Reserve website:

      The Survey of Consumer Finances (SCF) is normally a triennial cross-sectional survey of U.S. families. The survey data include information on families’ balance sheets, pensions, income, and demographic characteristics. Information is also included from related surveys of pension providers and the earlier such surveys conducted by the Federal Reserve Board. No other study for the country collects comparable information. Data from the SCF are widely used, from analysis at the Federal Reserve and other branches of government to scholarly work at the major economic research centers.

      xxxxxxxxxx
      # SCF Combined Extract Data

      2. SCF Combined Extract Data¶

      xxxxxxxxxx
      | Feature | Description |
      Feature Description
      ACTBUS Total value of actively managed business(es), 2019 dollars
      AGE Age of reference person
      AGECL Age group of the reference person
      ANNUIT Amount R would receive if they cashed in annuities, 2019 dollars
      ANYPEN Pension exists for either reference person or spouse
      ASSET Total value of assets held by household, 2019 dollars
      ASSETCAT Asset percentile groups
      BCALL Information used for borrowing decisions
      BDONT Information used for borrowing decisions
      BFINPLAN Information used for borrowing decisions
      BFINPRO Information used for borrowing decisions
      BFRIENDWORK Information used for borrowing decisions
      BINTERNET Information used for borrowing decisions
      BMAGZNEWS Information used for borrowing decisions
      BMAILADTV Information used for borrowing decisions
      BNKRUPLAST5 Household has declared bankruptcy in the past 5 years
      BOND Total value of directly held bonds held by household, 2019 dollars
      BOTHER Information used for borrowing decisions
      BPLANCJ Either reference person or spouse/partner has both types of pension plan on a current job
      BSELF Information used for borrowing decisions
      BSHOPGRDL Shopping for borrowing and credit
      BSHOPMODR Shopping for borrowing and credit
      BSHOPNONE Shopping for borrowing and credit
      BUS Total value of business(es) in which the household has either an active or nonactive interest, 2019 dollars
      BUSSEFARMINC Income from business, sole proprietorship, and farm, 2019 dollars
      BUSVEH Household has vehicle(s) owned by business
      CALL Total value of call accounts held by household, 2019 dollars
      CANTMANG Why no checking account
      CASEID Case ID (numeric)
      CASHLI Total cash value of whole life insurance held by household, 2019 dollars
      CCBAL Total value of credit card balances held by household, 2019 dollars
      CDS Total value of certificates of deposit held by household, 2019 dollars
      CHECKING Total value of checking accounts held by household, 2019 dollars
      CKCONNECTN Why chose main checking account institution
      CKCONVPAYRL Why chose main checking account institution
      CKLOCATION Why chose main checking account institution
      CKLONGTIME Why chose main checking account institution
      CKLOWFEEBAL Why chose main checking account institution
      CKMANYSVCS Why chose main checking account institution
      CKOTHCHOOSE Why chose main checking account institution
      CKPERSONAL Why chose main checking account institution
      CKRECOMFRND Why chose main checking account institution
      CKSAFETY Why chose main checking account institution
      COMUTF amount in combination and other mutual funds, 2019 dollars
      CONSPAY total monthly consumer debt payments, 2019 dollars
      CPI_DEFL Deflator Value
      CREDIT Why no checking account
      CURRPEN current value in pension, 2019 dollars
      DBPLANCJ Either reference person or spouse/partner has a defined benefit pension on a current job
      DBPLANT Either reference person or spouse/partner has DB plan on current job or some type of pension from a past job to be received in the future
      DCPLANCJ Either reference person or spouse/partner has any type of account-based plan on a current job
      DEBT Total value of debt held by household, 2019 dollars
      DEBT2INC Ratio of total debt to total income
      DEQ Total value of equity in directly held stocks, stock mutual funds, and combination mutual funds held by household, 2019 dollars
      DONTLIKE Why no checking account
      DONTWANT Why no checking account
      DONTWRIT Why no checking account
      EDCL Education category of reference person
      EDN_INST Total value of education loans held by household, 2019 dollars
      EDUC Highest completed grade by reference person
      EHCHKG people w/o checking accounts
      EMERGBORR Respondent would borrow money in a hypothetical financial emergency
      EMERGCUT Respondent would cut back spending in a hypothetical financial emergency
      EMERGPSTP Respondent would postpone payments in a hypothetical financial emergency
      EMERGSAV Respondent would spend out of savings in a hypothetical financial emergency
      EQUITINC ratio of equity to normal income
      EQUITY Total value of financial assets held by household that are invested in stock, 2019 dollars
      EXPENSHILO Households overall expenses over last 12 months
      FAMSTRUCT Family structure of household
      FARMBUS_KG capital gains on farm businesses, 2019 dollars
      FARMBUS compute value of business part of farm net of outstanding mortgages, 2019 dollars
      FEARDENIAL Household feared being denied credit in the past 5 years
      FIN Total value of financial assets held by household, 2019 dollars
      FINLIT Number of financial literacy questions answered correctly
      FOODAWAY Total amount spent on food away from home, annualized, 2019 dollars
      FOODDELV Total amount spent on food delivered to home, annualized, 2019 dollars
      FOODHOME Total amount spent on food at home, annualized, 2019 dollars
      FORECLLAST5 Respondent had a foreclosure in the last five years
      FUTPEN future pensions (accumulated in an account for the R/S), 2019 dollars
      GBMUTF amount in government bond mutual funds, 2019 dollars
      GOVTBND US government and government agency bonds and bills, 2019 dollars
      HBORRALT Respondent would borrow money from alternative sources in a hypothetical financial emergency
      HBORRCC Respondent would borrow money using a credit card in a hypothetical financial emergency
      HBORRFF Respondent would borrow money from friends or family in a hypothetical financial emergency
      HBORRFIN Respondent would borrow money from financial services in a hypothetical financial emergency
      HBROK have a brokerage account
      HBUS Have active or non-actively managed business(es)
      HCUTENT Respondent would postpone payments for entertainment in a hypothetical financial emergency
      HCUTFOOD Respondent would cut back on food purchases in a hypothetical financial emergency
      HCUTOTH Respondent would postpone other payments in a hypothetical financial emergency
      HDEBT Household has any debt
      HELOC_YN Currently borrowing on home equity line of credit
      HELOC Total value of home equity lines of credit secured by the primary residence held by the household, 2019 dollars
      HHSEX Gender of household reference person
      HLIQ Household has any checking, savings, money market or call accounts
      HMORT2 Have junior lien mortgage not used for purchase of primary residence
      HOMEEQ Total value of equity in primary residence of household, 2019 dollars
      HOUSECL Home-ownership category of household
      HOUSES Total value of primary residence of household, 2019 dollars
      HPAYDAY Household had a payday loan within the past year
      HPRIM_MORT Have first lien mortgage on primary residence
      HPSTPLN Respondent would postpone payments on loans in a hypothetical financial emergency
      HPSTPOTH Respondent would postpone other payments in a hypothetical financial emergency
      HPSTPPAY Respondent would postpone payments for purchases in a hypothetical financial emergency
      HSAVFIN Respondent would spend out of financial sources in a hypothetical financial emergency
      HSAVNFIN Respondent would spend out of non-financial sources in a hypothetical financial emergency
      HSEC_MORT Have junior lien mortgage on primary residence
      HSTOCKS have stocks?
      HTRAD traded in the past year
      ICALL Information used for investing decisions
      IDONT Information used for investing decisions
      IFINPLAN Information used for investing decisions
      IFINPRO Information used for investing decisions
      IFRIENDWORK Information used for investing decisions
      IINTERNET Information used for investing decisions
      IMAGZNEWS Information used for investing decisions
      IMAILADTV Information used for investing decisions
      INCCAT Income percentile groups
      INCOME Total amount of income of household, 2019 dollars
      INCPCTLECAT Alternate income percentile groups
      INCQRTCAT Income quartile groups
      INDCAT Industry classifications for reference person
      INSTALL Total value of installment loans held by household, 2019 dollars
      INTDIVINC Interest (taxable and nontaxable) and dividend income, 2019 dollars
      INTERNET Do business with financial institution via the Internet
      IOTHER Information used for investing decisions
      IRAKH Total value of IRA/Keogh accounts, 2019 dollars
      ISELF Information used for investing decisions
      ISHOPGRDL Shopping for saving and investments
      ISHOPMODR Shopping for saving and investments
      ISHOPNONE Shopping for saving and investments
      KGBUS Unrealized capital gains or losses on businesses, 2019 dollars
      KGHOUSE Unrealized capital gains or losses on the primary residence, 2019 dollars
      KGINC Capital gain or loss income, 2019 dollars
      KGINC Capital gain or loss income, 2019 dollars
      KGORE Unrealized capital gains or losses on other real estate, 2019 dollars
      KGSTMF Unrealized capital gains or losses on stocks and mutual funds, 2019 dollars
      KGTOTAL Total unrealized capital gains or losses for the household, 2019 dollars
      KIDS Total number of children in household
      KNOWL Respondent's knowledge of personal finances
      LATE Household had any late debt payments in last year
      LATE60 Household had any debt payments more than 60 days past due in last year
      LEASE have leased vehicle
      LEVRATIO Ratio of total debt to total assets
      LF Labor force participation of reference person
      LIFECL Life cycle of reference person
      LIQ Total value of all types of transactions accounts, 2019 dollars
      LLOAN1 Total balance of household loans where the lender is a commercial bank, 2019 dollars
      LLOAN10 Total balance of household loans where the lender is a store and/or a credit card, 2019 dollars
      LLOAN11 Total balance of household loans where the lender is a pension, 2019 dollars
      LLOAN12 Total balance of household loans where the lender is other, unclassifiable, or foreign, 2019 dollars
      LLOAN2 Total balance of household loans where the lender is saving and loan, 2019 dollars
      LLOAN3 Total balance of household loans where the lender is credit union, 2019 dollars
      LLOAN4 Total balance of household loans where the lender is finance, loan or leasing company, or inc debt consolidator, 2019 dollars
      LLOAN5 Total balance of household loans where the lender is a brokerage and/or life insurance company, 2019 dollars
      LLOAN6 Total balance of household loans where the lender is a real estate company, 2019 dollars
      LLOAN7 Total balance of household loans where the lender is an individual, 2019 dollars
      LLOAN8 Total balance of household loans where the lender is an other non-financial, 2019 dollars
      LLOAN9 Total balance of household loans where the lender is government, 2019 dollars
      MARRIED Marital status of reference person
      MINBAL Why no checking account
      MMA Total value of money market deposit and money market mutual fund accounts, 2019 dollars
      MMDA money market deposit accounts, 2019 dollars
      MMMF money market mutual funds, 2019 dollars
      MORT1 Amount owed on mortgage 1, 2019 dollars
      MORT2 Amount owed on mortgage 2, 2019 dollars
      MORT3 Amount owed on mortgage 3, 2019 dollars
      MORTBND mortgage-backed bonds, 2019 dollars
      MORTPAY total monthly mortgage payments, 2019 dollars
      MRTHEL Total value of debt secured by the primary residence held by household, 2019 dollars
      NBUSVEH Total number of business vehicles held by household
      NETWORTH Total net worth of household, 2019 dollars
      NEWCAR1 number of car/truck/SUV with model year no older than two years before the survey
      NEWCAR2 number of car/truck/SUV with model year no older than one year before the survey
      NFIN Total value of non-financial assets held by household, 2019 dollars
      NH_MORT Total value of mortgages and home equity loans secured by the primary residence held by household, 2019 dollars
      NHNFIN total non-financial assets excluding principal residences, 2019 dollars
      NINCCAT Normal income percentile groups
      NINCPCTLECAT Alternate Normal income percentile groups
      NINCQRTCAT Normal income quartile groups
      NLEASE number of leased vehicles
      NMMF Total value of directly held pooled investment funds held by household, 2019 dollars
      NNRESRE Total value of net equity in nonresidential real estate held by household, 2019 dollars
      NOCCBAL Household does not carry a balance on credit cards
      NOCHK Household has no checking account
      NOFINRISK Respondent not willing to take financial risk
      NOMONEY Why no checking account
      NONACTBUS Value of non-actively managed business(es), 2019 dollars
      NORMINC Household normal income, 2019 dollars
      NOTXBND tax-exempt bonds (state and local bonds), 2019 dollars
      NOWN number of owned vehicles
      NSTOCKS number different companies in which hold stock
      NTRAD number of trades per year
      NVEHIC total number of vehicles (owned and leased)
      NWCAT Net worth percentile groups
      NWPCTLECAT Alternate net worth percentile groups
      OBMUTF amount in other bond mutual funds, 2019 dollars
      OBND corporate and foreign bonds, 2019 dollars
      OCCAT1 Occupation categories for reference person
      OCCAT2 Occupation classification for reference person
      ODEBT Total value of other debts held by household, 2019 dollars
      OMUTF amount in other mutual funds, 2019 dollars
      ORESRE Total value of other residential real estate held by household, 2019 dollars
      OTH_INST Total value of other installment loans held by household, 2019 dollars
      OTHER Why no checking account
      OTHFIN Total value of other financial assets, 2019 dollars
      OTHLOC Total value of other lines of credit held by household, 2019 dollars
      OTHMA Total value of other managed assets held by household, 2019 dollars
      OTHNFIN Total value of other non-financial assets held by household, 2019 dollars
      OWN have an owned vehicle
      PAYEDU1 payments on first education loan, 2019 dollars
      PAYEDU2 payments on second education loan, 2019 dollars
      PAYEDU3 payments on third education loan, 2019 dollars
      PAYEDU4 payments on fourth education loan, 2019 dollars
      PAYEDU5 payments on fifth education loan, 2019 dollars
      PAYEDU6 payments on sixth education loan, 2019 dollars
      PAYEDU7 payments on seventh education loan, 2019 dollars
      PAYHI1 payments on first home improvement loan, 2019 dollars
      PAYHI2 payments on second home improvement loan, 2019 dollars
      PAYILN1 payments on first installment loan, 2019 dollars
      PAYILN2 payments on second installment loan, 2019 dollars
      PAYILN3 payments on third installment loan, 2019 dollars
      PAYILN4 payments on fourth installment loan, 2019 dollars
      PAYILN5 payments on fifth installment loan, 2019 dollars
      PAYILN6 payments on sixth installment loan, 2019 dollars
      PAYILN7 payments on seventh installment loan, 2019 dollars
      PAYINS payments on loans against insurance policies, 2019 dollars
      PAYLC1 payments on first land contract, 2019 dollars
      PAYLC2 payments on second land contract, 2019 dollars
      PAYLCO payments on other land contracts, 2019 dollars
      PAYLOC1 payments on first line of credit, 2019 dollars
      PAYLOC2 payments on second line of credit, 2019 dollars
      PAYLOC3 payments on third line of credit, 2019 dollars
      PAYLOCO payments on other lines of credit, 2019 dollars
      PAYMARG payments on margin loans, 2019 dollars
      PAYMORT1 payments on first mortgage, 2019 dollars
      PAYMORT2 payments on second mortgage, 2019 dollars
      PAYMORT3 payments on third mortgage, 2019 dollars
      PAYMORTO payments on other loans, 2019 dollars
      PAYORE1 payments on first other residential property, 2019 dollars
      PAYORE2 payments on second other residential property, 2019 dollars
      PAYORE3 payments on third other residential property, 2019 dollars
      PAYOREV payments on remaining other residential properties, 2019 dollars
      PAYPEN1 payments on loan against first pension plan not previously reported, 2019 dollars
      PAYPEN2 payments on loan against second pension plan not previously reported, 2019 dollars
      PAYPEN3 payments on loan against third pension plan not previously reported, 2019 dollars
      PAYPEN4 payments on loan against fourth pension plan not previously reported, 2019 dollars
      PAYPEN5 payments on loan against fifth pension plan not previously reported, 2019 dollars
      PAYPEN6 payments on loan against sixth pension plan not previously reported, 2019 dollars
      PAYVEH1 payments on first vehicle, 2019 dollars
      PAYVEH2 payments on second vehicle, 2019 dollars
      PAYVEH3 payments on third vehicle, 2019 dollars
      PAYVEH4 payments on fourth vehicle, 2019 dollars
      PAYVEHM payments on remaining vehicles, 2019 dollars
      PAYVEO1 payment on first other vehicle, 2019 dollars
      PAYVEO2 payment on second other vehicle, 2019 dollars
      PAYVEOM payment on remaining other vehicles, 2019 dollars
      PENACCTWD Withdrawals from IRAs and tax-deferred pension accounts, 2019 dollars
      PIR40 Household has a PIR higher than 40%
      PIRCONS ratio of monthly non-mortgage non-revolving consumer debt payments to monthly income
      PIRMORT ratio of monthly mortgage payments to monthly income
      PIRREV ratio of monthly revolving debt payments to monthly income
      PIRTOTAL Ratio of monthly debt payments to monthly income
      PLOAN1 Total value of aggregate loan balance by loan purpose
      PLOAN2 Total value of aggregate loan balance by loan purpose
      PLOAN3 Total value of aggregate loan balance by loan purpose
      PLOAN4 Total value of aggregate loan balance by loan purpose
      PLOAN5 Total value of aggregate loan balance by loan purpose
      PLOAN6 Total value of aggregate loan balance by loan purpose
      PLOAN7 Total value of aggregate loan balance by loan purpose
      PLOAN8 Total value of aggregate loan balance by loan purpose
      PREPAID Amount in prepaid card accounts, 2019 dollars
      PURCH1 First lien on primary residence used for purchase of primary residence
      PURCH2 Junior lien on primary residence used for purchase of primary residence
      RACE Race/ethnicity of respondent
      RACECL Class of race of respondent
      RACECL4 Alternate class of race of respondent
      REFIN_EVER Refinanced first lien mortgage on primary residence
      RENT Monthly rent, 2019 dollars
      RESDBT Total value of debt for other residential property held by households, 2019 dollars
      RETEQ Total value of equity in quasi-liquid retirement assets, 2019 dollars
      RETQLIQ Total value of quasi-liquid held by household, 2019 dollars
      REVPAY total monthly revolving debt payments, 2019 dollars
      SAVBND Total value of savings bonds held by household, 2019 dollars
      SAVED Indicator of whether the household saved over the past 12 months
      SAVING Total value of savings accounts held by household, 2019 dollars
      SAVRES1 Reason for saving
      SAVRES2 Reason for saving
      SAVRES3 Reason for saving
      SAVRES4 Reason for saving
      SAVRES5 Reason for saving
      SAVRES6 Reason for saving
      SAVRES7 Reason for saving
      SAVRES8 Reason for saving
      SAVRES9 Reason for saving
      SPENDLESS R would spend less if assets depreciated in value
      SPENDMOR R would spend more if assets appreciated in value
      SSRETINC Social security and pension income, 2019 dollars
      STMUTF amount in stock mutual funds, 2019 dollars
      STOCKS Total value of directly held stocks held by household, 2019 dollars
      SVCCHG Why no checking account
      TFBMUTF amount in tax-free bond mutual funds, 2019 dollars
      THRIFT Total value of account-type pension plans from R and spouse's current job, 2019 dollars
      TPAY Total value of monthly debt payments, 2019 dollars
      TRANSFOTHINC Unemployment, alimony/child support, TANF/food stamps/SSI, and other income, 2019 dollars
      TRUSTS Amount R would receive if they cashed in trusts, 2019 dollars
      TURNDOWN Household has been turned down for credit in the past 5 years
      TURNFEAR Household has been turned down for credit or feared being denied credit in the past 5 years
      VEH_INST Total value of vehicle loans held by household, 2019 dollars
      VEHIC Total value of all vehicles held by household, 2019 dollars
      VLEASE Total value of leased vehicles held by household, 2019 dollars
      WAGEINC Wage and salary income, 2019 dollars
      WGT Sample weight
      WHYNOCKG Reason household does not have a checking account
      WILSH Wilshire index of stock prices
      WSAVED spent more/same/less than income in past year
      X1 Case ID with implicate number
      XX1 Case ID
      Y1 Case ID with implicate number
      YEAR Survey Year
      YESFINRISK Respondent willing to take financial risk
      YY1 Case ID
      xxxxxxxxxx
      ---

      Copyright 2022 WorldQuant University. This content is licensed solely for personal use. Redistribution or publication of this material is strictly prohibited.

      xxxxxxxxxx
      Advanced Tools
      xxxxxxxxxx
      xxxxxxxxxx

      -

      Variables

      Callstack

        Breakpoints

        Source

        xxxxxxxxxx
        1
          0
          0
          Python
          No Kernel | Unknown
          Saving completed
          Uploading…
          064-interactive-dash-app.ipynb
          English (United States)
          Spaces: 4
          Ln 1, Col 1
          Mode: Command
          • Console
          • Change Kernel…
          • Clear Console Cells
          • Close and Shut Down…
          • Insert Line Break
          • Interrupt Kernel
          • New Console
          • Restart Kernel…
          • Run Cell (forced)
          • Run Cell (unforced)
          • Show All Kernel Activity
          • Debugger
          • Continue
            Continue
            F9
          • Evaluate Code
            Evaluate Code
          • Next
            Next
            F10
          • Step In
            Step In
            F11
          • Step Out
            Step Out
            Shift+F11
          • Terminate
            Terminate
            Shift+F9
          • Extension Manager
          • Enable Extension Manager
          • File Operations
          • Autosave Documents
          • Open from Path…
            Open from path
          • Reload Notebook from Disk
            Reload contents from disk
          • Revert Notebook to Checkpoint
            Revert contents to previous checkpoint
          • Save Notebook
            Save and create checkpoint
            Ctrl+S
          • Save Notebook As…
            Save with new path
            Ctrl+Shift+S
          • Show Active File in File Browser
          • Trust HTML File
          • Help
          • About JupyterLab
          • Jupyter Forum
          • Jupyter Reference
          • JupyterLab FAQ
          • JupyterLab Reference
          • Launch Classic Notebook
          • Licenses
          • Markdown Reference
          • Reset Application State
          • Image Viewer
          • Flip image horizontally
            H
          • Flip image vertically
            V
          • Invert Colors
            I
          • Reset Image
            0
          • Rotate Clockwise
            ]
          • Rotate Counterclockwise
            [
          • Zoom In
            =
          • Zoom Out
            -
          • Kernel Operations
          • Shut Down All Kernels…
          • Launcher
          • New Launcher
          • Main Area
          • Activate Next Tab
            Ctrl+Shift+]
          • Activate Next Tab Bar
            Ctrl+Shift+.
          • Activate Previous Tab
            Ctrl+Shift+[
          • Activate Previous Tab Bar
            Ctrl+Shift+,
          • Activate Previously Used Tab
            Ctrl+Shift+'
          • Close All Other Tabs
          • Close All Tabs
          • Close Tab
            Alt+W
          • Close Tabs to Right
          • Find Next
            Ctrl+G
          • Find Previous
            Ctrl+Shift+G
          • Find…
            Ctrl+F
          • Log Out
            Log out of JupyterLab
          • Presentation Mode
          • Show Header Above Content
          • Show Left Sidebar
            Ctrl+B
          • Show Log Console
          • Show Right Sidebar
          • Show Status Bar
          • Shut Down
            Shut down JupyterLab
          • Simple Interface
            Ctrl+Shift+D
          • Notebook Cell Operations
          • Change to Code Cell Type
            Y
          • Change to Heading 1
            1
          • Change to Heading 2
            2
          • Change to Heading 3
            3
          • Change to Heading 4
            4
          • Change to Heading 5
            5
          • Change to Heading 6
            6
          • Change to Markdown Cell Type
            M
          • Change to Raw Cell Type
            R
          • Clear Outputs
          • Collapse All Code
          • Collapse All Outputs
          • Collapse Selected Code
          • Collapse Selected Outputs
          • Copy Cells
            C
          • Cut Cells
            X
          • Delete Cells
            D, D
          • Disable Scrolling for Outputs
          • Enable Scrolling for Outputs
          • Expand All Code
          • Expand All Outputs
          • Expand Selected Code
          • Expand Selected Outputs
          • Extend Selection Above
            Shift+K
          • Extend Selection Below
            Shift+J
          • Extend Selection to Bottom
            Shift+End
          • Extend Selection to Top
            Shift+Home
          • Insert Cell Above
            A
          • Insert Cell Below
            B
          • Merge Cell Above
            Ctrl+Backspace
          • Merge Cell Below
            Ctrl+Shift+M
          • Merge Selected Cells
            Shift+M
          • Move Cells Down
          • Move Cells Up
          • Paste Cells Above
          • Paste Cells and Replace
          • Paste Cells Below
            V
          • Redo Cell Operation
            Shift+Z
          • Run Selected Cells
            Shift+Enter
          • Run Selected Cells and Don't Advance
            Ctrl+Enter
          • Run Selected Cells and Insert Below
            Alt+Enter
          • Run Selected Text or Current Line in Console
          • Select Cell Above
            K
          • Select Cell Below
            J
          • Split Cell
            Ctrl+Shift+-
          • Undo Cell Operation
            Z
          • Notebook Operations
          • Change Kernel…
          • Clear All Outputs
          • Close and Shut Down
          • Collapse All Cells
          • Deselect All Cells
          • Enter Command Mode
            Ctrl+M
          • Enter Edit Mode
            Enter
          • Expand All Headings
          • Interrupt Kernel
          • New Console for Notebook
          • New Notebook
            Create a new notebook
          • Reconnect To Kernel
          • Render All Markdown Cells
          • Restart Kernel and Clear All Outputs…
          • Restart Kernel and Run All Cells…
          • Restart Kernel and Run up to Selected Cell…
          • Restart Kernel…
          • Run All Above Selected Cell
          • Run All Cells
          • Run Selected Cell and All Below
          • Select All Cells
            Ctrl+A
          • Toggle All Line Numbers
            Shift+L
          • Toggle Collapse Notebook Heading
            T
          • Trust Notebook
          • Settings
          • Advanced Settings Editor
            Ctrl+,
          • Show Contextual Help
          • Show Contextual Help
            Live updating code documentation from the active kernel
            Ctrl+I
          • Spell Checker
          • Choose spellchecker language
          • Toggle spellchecker
          • Terminal
          • Decrease Terminal Font Size
          • Increase Terminal Font Size
          • New Terminal
            Start a new terminal session
          • Refresh Terminal
            Refresh the current terminal session
          • Use Terminal Theme: Dark
            Set the terminal theme
          • Use Terminal Theme: Inherit
            Set the terminal theme
          • Use Terminal Theme: Light
            Set the terminal theme
          • Text Editor
          • Decrease Font Size
          • Increase Font Size
          • Indent with Tab
          • New Markdown File
            Create a new markdown file
          • New Python File
            Create a new Python file
          • New Text File
            Create a new text file
          • Spaces: 1
          • Spaces: 2
          • Spaces: 4
          • Spaces: 8
          • Theme
          • Decrease Code Font Size
          • Decrease Content Font Size
          • Decrease UI Font Size
          • Increase Code Font Size
          • Increase Content Font Size
          • Increase UI Font Size
          • Theme Scrollbars
          • Use Theme: JupyterLab Dark
          • Use Theme: JupyterLab Light